Advanced Operating Systems

Advanced Operating Systems

In progress notes for CS6210 at Georgia Tech

·

34 min read

Table of contents

OS Structures

SPIN Approach

What are we shooting for in an OS structure?

  • Thin, only mechanisms should be in the kernel, not policies in the kernel itself (like microkernel)

  • Access to resources, without border crossing as much as possible (like DOS)

  • Flexibility for resource management, so that resource management can be easily morphed to suit the needs of an application (like microkernel)

  • Flexibility without sacrificing protection and performance (like monolithic)

Approaches to Extensibility

Capability based (Hydra OS (Wulf ‘81))

  • Kernel Mechanisms (not policies) for resource allocation

  • Capability based resource access. A capability is an entity that can be passed from one to the other, cannot be forged, and can be verified.

  • Resource manager were built as coarse-grained objects to reduce border crossing overhead. Border crossing in Hydra OS would mean passing capability from one object to another, and validate the capability for entering a particular object. Building RM as coarse-grained objects limit the opportunities for customization and extensibility.

  • Hydra had good ideas for providing minimal mechanisms in the kernel, and having RM implement policies. Resource access of fundamental access through capabilities were heavy weight abstract notion for efficient implementation. But in practice, Hydra did not fulfill its goal of extensibility.

MKernel based (e.g., Mach from CMU in 90’s)

  • Mach focused on extensibility and portability

  • Provides limited mechanisms in the micro kernel, and implemented all services you’d expect from an OS as server processes that run as normal user level processes above the kernel.

  • Mach achieved its goal of extensibility and portability. Performance took a back seat.

  • This gave Micro Kernel based design bad press, since performance was not a priority, as OS is supposed to be very performant.

SPIN Approach to Extensibility

  • Keep a minimal kernel with its extensions in the same hardware address space. This avoids border crossings between the kernel and its extensions, greatly improving performance.

  • SPIN utilizes a strongly typed language as the basis of building the OS. In this way, SPIN also creates the necessary protections an OS needs. This means SPIN has a Compiler enforced modularity.

  • Data abstractions provided by the programming language such as an object, serve as containers for Logical Protection Domains. We are no longer reliant on hardware address spaces to provide protection between different services and the kernel.

  • The kernel only provides the interfaces, and the LPD are the ones that actually implement the functionality that is defined in the interface functions.

  • There can be several different implementations of interface functions. Kind of like how C has header files. This offers flexibility, because applications can dynamically bind different implementations of the same interface functions.

  • SPIN makes extensions as cheap as a procedure call

Logical Protection Domains

Modula-3 is a strongly typed language with built in safety and encapsulation mechanisms.

  • It has automatic memory management, meaning there are no leaks.

  • Supports data abstraction called Objects. Allows exposing externally visible methods inside an object using generic interfaces.

  • Supports threads that can run in the context of the object, and allows raising exceptions (i.e. memory access violation)

  • Modula-3 allows the creation of logical protection domains.

  • It allows us to get the safety of a monolithic kernel, without having to put system code in a separate hardware address space.

  • Objects that implement specific services can be the desired granularity of the system designer. Fine-grained protection can be done via capabilities. For example:

    • Individual hardware resources (e.g., page frame)

    • Interfaces that provide a certain functionality (e.g., page allocation module)

    • Collection of interfaces (e.g., entire virtual machine)

  • Capabilities to objects can be supported as pointers. Capabilities in this sense are much lighter weight than previously discussed in the Hydra OS.

  • Pointers in Module-3 are type-specific, while in C, they are not.

Spin Mechanisms for Protection Domains

There are 3 Mechanisms in SPIN to create protection domains and use them.

Create

Allows creating a logical protection domain. Initiating an object file with the contents and export the names that are contained as entry point methods inside the object, to be visible outside.

Resolve

Resolve is similar to linking 2 separate compiled files together. Resolves the names between two object files. Names between source + target, two logical protection domains. They are dynamically linked and bound together. Accessing methods inside the target domain, resource sharing happens at memory speeds. As efficient as a procedure call.

Combine

Create aggregation larger protection domain. Once names in source and target domain are resolved, an aggregate domain can be created. The entry points are a union between source and target, or any number of domains.

Customized OS with Spin

The logical protection domain gives us the ability to extend SPIN to include OS services and make it all part of the same hardware address space. No border crossing between services and mechanisms provided by SPIN.

Two different extensions of SPIN can live on the same hardware framework. In this picture, while there is some common sub systems, process 2 uses a differently implemented memory manager than process 1.

Spin Mechanisms for Events

An OS needs to field events in order to process such events. SPIN has an event-based communication model. Services can register event handlers with the SPIN event dispatcher. It supports many mappings between event and event handler, such as 1-to-1 mapping, many-to-1 mapping, and 1-to-many mappings.

Default Core Services in SPIN

SPIN provides interface procedures for implementing services in the OS. Think of them like SPIN providing header files, that can have different Logical Protection Domain integrations. Once the LPD is properly instantiated, it becomes an extension of SPIN. There is no border crossing between SPIN and the extensions.

Memory Management:

  • Physical Address: Allocate, Deallocate, Reclaim

  • Virtual Address: Allocate, Deallocate

  • Translation: Create/Destroy Address spaces, Add/Remove mapping between virtual pages and physical frames

  • Event Handlers: Page fault, access fault, bad address

CPU Scheduling:

  • SPIN global scheduler interacts with the application (extension that is living atop SPIN, which could be an entire OS, or just an app) threads package. SPIN global scheduler decides at a Macro level, the amount of time that is given to a particular extension.

  • SPIN Abstraction Strand, which is a unit of scheduling that SPINs global scheduler uses. The semantics of strand is entirely defined by the extension.

  • Event handlers: Block, Unblock, Checkpoint, Resume

Conclusion

SPIN introduces a flexible and customizable system where applications can extend kernel functionality by a dynamically adding code modules. These extensions allow the system to adapt its behavior according to the needs of specific applications, improving performance and flexibility.

This method, however, would be consider unsafe in some instances, without the leveraging of language-based protection mechanisms that SPIN imposes on the OS. It utilizes Modula-3, a type-safe programming language, to prevent any unauthorized or harmful code from compromising the stability or security of the system. This ensures applications can add custom functionality while not performing dangerous operations that can corrupt the kernel data.

By utilizing these methods, and executing kernel extensions written in Modula-3 within the same address space, SPIN can avoid the overhead of traditional context switches and data copying. The use of fine-grained efficient event handling and safe access to kernel data structures allows the system to achieve high performance. SPIN uses an fine-grained event-based model where extensions are triggered by specific system events.

This paper presents experimental results showing how SPIN can achieve performance levels comparable to or better than traditional monolithic OS. The authors demonstrate this through various benchmarks, including file system and network performance evaluations.

The SPIN OS offers a novel approach to OS design by providing a framework that allows for safe and efficient extensibility by utilizing language-based safety mechanisms, as well as using the same hardware space. This makes SPIN compelling, particularly in scenarios where customization and efficiency are crucial.

Exokernel Approach

Exokernel comes from the fact that the kernel exposes hardware explicitly to the operating system extensions living above it. In the Exokernel approach, we are decoupling authorization of hardware from its actual use.

The request is processed in the Exokernel, and the hardware is then exposed through secure binding. The Exokernel then returns the encrypted key to the requesting library operating system. The semantics of how the resources are going to be used are irrelevant to the Exokernel and are up to the library OS. After all of this, the OS can utilize the resource by presenting the key it received to the Exokernel, in which the Exokernel will validate and then grant access.

In this way, the Exokernel serves as a sort of doorman between the OS and the Hardware.

Examples of Candidate Resources

TLB Entry

  • Virtual to Physical mapping done by library.

  • Binding is presented to Exokernel with encrypted key.

  • Exokernel puts the mapping into the hardware TLB.

  • The process that will be used by this new virtual page mapping can be done multiple times without Exokernel intervention.

Packet Filter

  • Needs to be executed by the Exokernel every time there is a packet arrival, on behalf of the OS.

  • Predicates for incoming packet are loaded into kernel by library OS, which is a heavy-weight, but once the binding has been established, it does not incur the intervention by the Exokernel.

Implementing Secure Bindings

There are three methods for implementing secure bindings.

  • Hardware mechanisms, such as TLB entry, retrieving a physical page frame from the Exokernel, or portions of the frame buffer that is being used by the display, can be bound by the Exokernel to the OS. Once the OS has an encrypted key for the resource, it can use it any time it wants.

  • Software caching, such as “Shadow” TLB, which means caching the hardware TLB in a software cache for each OS to avoid the context switch penalties from switching from one OS to another.

  • Downloading code into Kernel, which is functionally equivalent to SPIN extensions, such as with the Packet Filter. However, Exokernel does not have the same protection benefits as SPIN, because Exokernel has arbitrarily defined code that is downloaded into the kernel.

Default Core Services in Exokernel

Memory Management (in the example of a Page Fault)

  1. When a thread incurs a page fault, it is first fielding by the Exokernel.

  2. The Exokernel knows which threads belong to which OS, so then it informs the appropriate Library OS through a registered handler.

  3. The OS then services the page fault which may involve requesting a page frame from the Exokernel to host the specific page that is missing.

  4. If this happens, the library OS will ask the Exokernel for a page frame, a binding will be created, and returning a new encrypted key.

  5. The Library OS then presents the mapping to the Exokernel to be placed in the hardware TLB.

Then, the Exokernel, if approved, will install the mapping in the TLB.

Memory Management using S-TLB

When we have a context switch, one of the biggest sources of performance lost is the fact that locality is lost when there is a newly scheduled process. When we switch from one Library OS to another, the TLB is completely wiped, which is massive overhead.

This is where the Software TLB (S-TLB) comes in. The S-TLB is a data structure in the Exokernel that represents the respective mappings for an OS.

On TLB clear, it will dump its information onto the S-TLB of the OS it was previously on. When there is an OS context switch, it will pre-load the TLB from the S-TLB of the respective OS.

Default Core Services in Exokernel (cont.)

CPU Scheduling maintains a linear vector of time slots. Time is divided into epochs of T1 to Tn , called a Time Quantum. Each time quantum has a start and end time. These time quantums represent the time that is allocated to the Library OS’ that live atop the Exokernel. Each Library OS gets to mark its time quantum at startup in the linear vector of time slots.

If an OS misbehaves, such as taking more time in its time quantum than allowed, the Exokernel will remember and it will take time off from its time quantum the next time it is scheduled. During the time quantum, the OS has complete control of the processor, and the Exokernel will not get involved unless a process that is running on behalf of the OS incurs a page fault.

Revocation of Resources

The Exokernel keeps track of which resources are allocated to which Library OS’. At any given time, the Exokernel can revoke resources that it’s granted to any Library OS. The Exokernel can send a revoke call with a repossession vector, detailing the resources it is revoking from the Library OS.

The Library OS is then responsible to take corrective action to clean up and finalize tasks with the resources that are being revoked.

The Library OS can also seed the Exokernel with autosave options for resources the Exokernel wants to revoke. Ex. If an Exokernel decides to revoke page frames from Library OS, the Library OS could’ve seeded the Exokernel ahead of time such that it will dump the page frames onto the disk, on behalf of the OS.

Code Usage by Exokernel (Examples)

  • Packet filter for de-multiplexing network packets

  • Run code in Exokernel on behalf of Library OS not currently scheduled (e.g. garbage collection for a specific App)

Putting it all Together

Exokernel Data Structures

Exokernel maintains the PE data structure on behalf of each Library OS. The PE data structures contains the entry points in the Library OS for dealing with the different kinds of program discontinuties, such as exceptions, external interrupts, system calls, and addressing context. Each PE structure is unique for every Library OS. This PE structure setup is functionally similar to the SPIN OS event handler.

In addition to the PE data structure, we also have the S-TLB for each OS as well. These are called the “guaranteed mappings.” The entry point associated with addressing context is particularly important for guaranteed mappings requested by the Library OS to have in the Exokernel PE, which is the set of TLB entries that are dumped into the S-TLB. Only the entries that have been guaranteed to be kept on behalf of a particular OS are dumped into the S-TLB.

Downloading code into the kernel allows first level interrupt handling on behalf of a Library OS. In a situation where an interrupt happens in a separate OS instead of the relative OS, this is particularly useful, as the downloaded code in the kernel can handle the interrupt instead of having issues with another OS receiving it.

Performance Results of SPIN and Exokernel

When you research papers, absolute numbers are meaningless. When analyzing performance results, you need to compare it to the competition of its time. Performance questions are always centered around space (memory) and time (performance).

How much better is X approach over Y approach?

What is the code size difference between X OS vs. Y OS?

L3 Microkernel Approach

The L3 Microkernel, by proof of construction, debunks the myths about the assumed low performance MKernel based approach. As a recap, the MKernel approach was deemed low performance because of it’s constant border crossing nature that ensured protection and extensibility, at the supposed cost of performance.

It is possible to construct an L3 Microkernel where microservices are still in their own protection domain, but in the same hardware address space. Therefore, it’s all about how efficient the implementation is.

Strikes Against Microkernel

Explicit costs:

  • Kernel user switches, the cost of the border crossing is too high

  • Address space switches, the communication for cross protection domain calls involves flushing the TLB every time.

  • Thread switches and inter process communication have to be mediated by the MKernel, which involves many border crossings at every point.

Implicit cost:

  • Memory Effects, such as constant locality loss when going across address spaces.

Debunking MKernel Myths with L3

L3 by proof of construction, systematically debunks all the myths in the “Strikes Against Microkernel” section.

User Kernel Border Crossing

The user to kernel based switching border crossing cost is not as high as thought. The L3 Microkernel can do this border cross including TLB and cache misses, in 123 processor cycles. Compared to the CMU’s Mach OS, which SPIN and Exokernel both compare themselves to, can do it in 900 processor cycles.

Address Space Switches

In Address Space switches, the TLB is used to translate the virtual page number into a physical frame number.

On context switch, the TLB is flushed, which is costly when switching address space often. However, it is possible for it to not be flushed. The TLB can possibly store address based tags as well, meaning the entire TLB does not need to be flushed, depending on the situation.

In an architecture that supports address space tags in the TLB:

  1. When an entry is made in the TLB, a Process ID (PID) is stored with the entry.

  2. When address translation happens, the virtual page number (VPN) is split into the tag and index (like normal), but this time we can get an address space tag to compare with the PID.

  3. If there is a hit, then we can check the tag of the VPN and compare it to the tag in the TLB.

  4. If both match, there is no need to flush the TLB on the context switch.

Liedtke, the author of the L3 Microkernel approach, suggests taking advantage of whatever the architecture offers you in regards to avoiding TLB flush. For example:

In x86 and PowerPC, segment registers are offered. Segment registers give an OS the opportunity to specify a range of virtual addresses that can be legally accessed by the currently running process. We can use segment registers to create a protection domain to bound addresses for a running process. The segment bounds are hardware enforced.

What if a protection domain is so large that it needs all of the Hardware address space?

If they are so large that it occupies the entire hardware address space, and the TLB does not support address based tagging, then you have to do a TLB flush on a context switch.

In this case, the explicit costs we’ve been talking about becomes insignificant compared to the much more massive implicit cost that is incurred. The cache effects of a large context switch of processes that take the entire hardware address space would have significant cost.

If we are switching in Small Protection Domains:

  • By taking advantage of architecture/hardware options given to you, careful construction of processes can render address space switching efficient.

If we are switching in Large Protection Domains:

  • Switching cost is insignificant relative to the implicit costs

  • The cache effects and TLB effects dominate, are a much more significant implicit costs

Thread Switches and IPC

The explicit cost of a thread switch, in which the Microkernel begins running another thread, involves saving all volatile state of the processor. By proof of construction, it is shown to be competitive to SPIN and Exokernel approaches.

Memory effects

Memory effects can be mitigated by carefully structuring the protection domains and the hardware address space. In the case that the protection domains are small, Liedtke suggests to pack all the protection domains in the same hardware address space, enforcing protection for the processes from one another by utilizing segment registers. In this way, the caches will be warm even on context switches.

In debunking the myths regarding address space switches, we also have strategies to mitigate memory effect issues as well.

Why was border crossing in Mach so expensive?

The significant focus on portability made it such that it was architecture independent to allow the Mach Microkernel to run on several different process architectures. This meant there was code bloat in Mach, because it has to have compatibility with any architecture it aims to run on. Specifically in Mach, there are architecture independent and architecture specific parts of the microkernel.

This means Mach has a large memory footprint, which means it has lesser locality, which means it has more cache misses, therefore there is longer latency for border crossings.

Portability and performance contradict each other.

Thesis of L3 of OS Structuring

The principles of L3 are as follows:

  • The microkernel should have minimal abstractions. This includes support for threads, address spaces, interprocess communication, and generating unique IDs. We need these 4, as ANY subsystem in an OS should have these in a microkernel.

  • Microkernels are processor-specific in implementation. Avoid portability if you want an efficient implementation of a microkernel. You need to exploit anything a certain architecture offers you.

The right set of microkernel abstractions and processor-specific implementation, you can build efficient processor-independent abstraction at higher layers.

Papers

Brian Bershad et al., "Extensibility, Safety and Performance in the SPIN Operating System ", Proceedings of the 15th ACM Symposium on Operating System Principles, December 1995.

Dawson R. Engler, Frans Kaashoek and James O'Toole, "Exokernel: An Operating System Architecture for Application-Level Resource Management ", Proceedings of the 15th ACM Symposium on Operating System Principles, ACM, December 1995.

J. Liedtke, " On Micro-Kernel Construction ", Proceedings of the 15th ACM Symposium on Operating System Principles, ACM, December 1995.

J. Liedtke, "Improved Address-Space Switching on Pentium Processors by Transparently Multiplexing User Address Spaces ", GMD Technical Report No. 933, November 1995 (self-study).

Virtualization

Introduction to Virtualization

Platform Virtualization

A black box that can run atop a virtual platform. The end user is satisfied as long as the platform has the same functionality as what the system appears to have.

Utility Computing

Multiple apps and OS’ can be run atop shared hardware resources. Different apps are run and they lower overall server costs, due to the fact that computing may not be used all at the same time.

Utility computing is promoted by data centers worldwide, and it is how different end-users are able to access more computing resources for their specific peak times, versus having to purchase their own computers.

Hypervisors

How is it possible that multiple different OS’ can run atop the same shared hardware resources? Who decides who gets what? A hypervisor is like an OS of OS’.

There are native (bare metal) hypervisors, which runs directly atop the shared hardware, and the OS’ run atop the hypervisor.

The second type is called hosted hypervisors. The hosted hypervisor runs atop a host OS, and allows the user to emulate the function of other OS’.

Connecting the Dots

  • The concept of virtualization started in the 70s by IBM VM370. The intent was to give the illusion that every user on a computer had ownership of said computer.

  • Microkernels were introduced in the late 80’s and 90’s

  • Extensibility of OS was investigated in the 90’s

  • The SIMOS was introduced in the late 90’s and laid the basis for the modern resurgence of virtualization technology, and was the basis for VMware.

  • Xen and VMware were based on papers laid out in the early 2000’s.

Now, accountability for usage in data centers. Companies can provide cloud virtualization services and bill users separately. Virtualization has made computing another utility, like water or electricity.

Full Virtualization

The Full Virtualization framework leaves the OS basically untouched, allowing you to run the unchanged binary of the OS atop the hypervisor. The OS’s are run as user level processes atop the hypervisor, which means they do not get the same privileges as running them directly on bare metal.

When the OS runs some privileged instructions, those instructions go through a trap in the hypervisor, where the hypervisor will emulate the intended functionality in the OS.

In some architectures, the privileged actions may fail silently. The hypervisor will try to look for the issues that may arise in those OS’, and then if those failures do happen, it will try to capture and fix them. Early on, this was a more significant problem, but now architectures are much better at supporting virtualization natively.

Para Virtualization

Another approach to virtualization is to modify the source code of the guest OS. Doing this, we can avoid problematic instructions, and include optimizations between the OS and the underlying hardware.

Less than 2% of the Guest OS code needs modification with para virtualization. This was proven in Xen by proof of construction.

Big Picture

Virtualize Hardware:

  • Memory Hierarchy

  • CPU

  • Devices

Effect data and control between guests and hypervisor

Memory Virtualization

Memory Hierarchy

Cache is physically tagged, so there isn’t much work that needs to be done regarding virtualizing them. The main struggle comes when handling virtual memory, mapping virtual addresses to the physical, which is a key functionality in the memory management in any subsystem.

Memory Subsystem Recall

Each process is in its own protection domain, and usually, a separate hardware address space. The OS maintains a page table on behalf of each of the processes. The page table is an OS data structure that holds the mapping between the virtual page numbers and the physical pages where those virtual pages are contained in the main memory of the hardware.

The physical memory is continuous, but the virtual addresses are not pointing to contiguous sections of physical memory, instead the locations are scattered across physical memory.

Memory Management and Hypervisor

In the virtualized setup, the hypervisor sits between the guest OS and the hardware. Each process in each OS is in its own protection domain, which means each process has a distinct page table in it’s OS. The hypervisor doesn’t know about these page tables stored in the OS’.

Memory Manager Zoomed Out

The physical memory AKA machine memory is in the control of the hypervisor. When the OS sees its “physical memory” it is actually an illusion shown by the hypervisor. It is not actual contiguous physical memory, but instead the hypervisor places the regions of physical memory in the machine memory. The needs of an OS may change through its lifespan, and the hypervisor would potentially need to allot more machine memory to an OS in a non-contiguous manner.

Zooming Back In

In a non-virtualized setting, the virtual page number goes through the page table and results with a physical page number.

Non-Virtualized Setting:

VPN → PT → PPN

In a virtualized setting, however, there is another level of indirection. The physical page number of the OS has to be mapped to the machine memory / page numbers (MPN). The mapping between the physical page number (PPN) and the machine page number (MPN) is kept in a shadow page table (SPT).

Virtualized Setting:

VPN → PT → PPN → S-PT → MPN

In the case of full virtualization, the guest OS does not know it is not running on bare metal, and therefore the hypervisor stores the PPN → MPN mappings.

In the case of para virtualization, the guest OS is aware it is being virtualized, and it can and usually does store the PPN → MPN mappings instead of the hypervisor.

Shadow Page Table

In many architectures the CPU uses the page table for address translation. In the case of virtualization, the regularly used hardware page table is now used as the shadow page table.

Efficient Mapping (Full Virtualization)

When a fully virtualized guest OS attempts to map a VPN to PPN through the page table, it requires a privileged action. This initiates a trap in the hypervisor, where the hypervisor will correspond the particular VPN to an entry in the S-PT. The translations are installed into the TLB / hardware PT.

Efficient Mapping (Para Virtualization)

In a para virtualized setting, the guest OS is aware that the physical memory is non-contiguous. The burden of efficient mapping can be moved from the hypervisor to the guest OS.

In Xen (a paravirtualized hypervisor), a set of hypercalls is provided for the guest OS to tell the hypervisor about the changes made to the hardware page table.

Guest OS makes hypercall to the hypervisor, allowing the guest OS to, for example, initialize a page frame, and then ask the hypervisor to create a page table on behalf of the OS with that initialized page frame. Other supported actions include switching as well as updating page tables.

Dynamically Increasing Memory

The hypervisor must be ready to handle the different memory needs that guest OS’ require by allocating them regions on the machine memory. If all the machine memory is taken up, the hypervisor can take back a portion of physical memory from another OS. Although, some memory issues may be ran into, it is also possible for the hypervisor to request more memory from the OS, so that it can voluntarily give some up to its peers.

Ballooning

The technique, ballooning, has a special device driver installed in every guest OS, regardless of full or para virtualization. The balloon device driver is the key for managing memory pressures that may be experienced by a virtual machine in a virtualized setting.

If a hypervisor needs more memory, it can contact a guest OS’ balloon driver through a private channel that exists between the balloon and the hypervisor. The hypervisor can then instruct the balloon to inflate itself in the guest OS, making it so the balloon requests more memory from the guest OS. After the balloon has gotten its necessary memory, it can return the physical memory gathered to the hypervisor.

If a hypervisor needs less memory, it can tell the guest OS’ balloon driver to deflate. The balloon deflating can release memory into the guest OS.

Sharing Memory Across Virtual Machines

It is possible, depending on the situation, to share space in the machine memory. For example, if two VMs are both running the same versions of Linux, with the same versions of Firefox, it is possible that in order to not store duplicate data, aspects of the software’s virtual page memory can be pointed to the same place in the machine memory.

One way to do it, is with cooperation between the the VMs and the hypervisor, where there are hooks that can inform each other of the situation.

VM Oblivious Page Sharing

An alternative, is for the VMs to be completely oblivious to the page sharing that is happening. The hypervisor can have a hash table that contains content hash of the machine pages.

  1. Take the contents of a page, create a content hash representing that page which corresponds to the PPN of the specific memory in the VM.

  2. Then, take the hash, and look through the hypervisor hash table to see if there is a match between the hash and any page currently in the machine memory.

  3. If there is a match, we have a hint that the data stored is a duplicate. For example, the other VM for which the matched data is a part of, could’ve modified the data in some kinda way.

  4. Then, a full comparison must be made, in order to verify it is in fact duplicate data.

  5. If there is a successful match, we can newly map the VM memory to the same duplicate spot in machine memory. Once we have done that, the reference table can increment a count representing how many maps there are to that place in memory.

  6. The entries are then marked as copy on write entries, indicating that they can share the page as long as they are reading it, but if any of them try to write to it, a copy of the original page will be made and things will be remapped.

  7. After this successful match process is done, the page frame can be freed up in the machine memory.

Memory Allocation Policies

Different policies are in place for allocating and reclaiming memory. The goal of virtualization is maximization of the utilization of resources. Virtualized environments use different policies for memory allocation, such as:

  • Pure share based approach: You pay less, you get less. The data center gives you a certain amount of resources directly correlated with the money you pay for said memory. This can potentially lead to holding if a VM gets a bunch of resources and holds onto it without using it.

  • Working-set based approach: If a VM process / working set needs more memory, you give it more memory, if it needs less memory, you give it less. However, some might feel that if you paid for a certain amount of resources, you should be able to always get those resources.

  • Dynamic idle-adjusted shares approach: Mixes both approaches. Tax idle pages more than active pages. If the resources I’ve given you are being actively used, more power to you. But if you are just holding it, the resources are taken away or “taxed.” Different data centers can have different taxation rates, with higher rates promoting a more “wealth-distribution” approach, and lower rates corresponding to a more pay for what you get approach.

    With a tax-rate somewhere in the middle, most idle memory can be reclaimed, but because its not being taxed at 100%, sudden working set increases are allowed to happen.

CPU & Device Virtualization

We aim for the guest OS on a hypervisor to have the illusion of owning the CPU. We also want the hypervisor to field events arising due to the execution of a process that belongs to a parent guest OS.

CPU Virtualization

In each guest OS, the processes are already being multiplexed by the OS even in a non-virtualized setting. The hypervisor has to give an illusion of ownership of the CPU to each guest OS. The hypervisor needs a precise way to account the time each guest OS is allocated on the CPU.

It is possible for the hypervisor to give the guest OS’ a proportional share, depending on the service agreement a VM has with the hypervisor. There is also a fair share scheduler, which gives a fair (equal) share to each guest OS.

In either case, the hypervisor must account for time that is taken by another guest OS during a guest OS’ time, such as is the case with external interrupts. If the time is taken away during a guest OS’ time, it will be rewarded back in the future.

Events that are currently executing on the CPU as dictated by a process running on it need to be delivered to the parent guest OS, in both the full and para virtualized settings. A process may incur a system call, page fault, exception, or external interrupt. All of these discontinuities have to be passed up by the hypervisor to the guest OS. The hypervisor needs to deliver these as software interrupts to the respective guest OS.

Some of the ways to handle these discontinuities, however, require privileged access to be able to handle them through the guest OS. In a full virtualization environment, it is possible for the guest OS to attempt a privileged action, get trapped by the hypervisor, and then fail silently. The hypervisor needs to account for potential quirks in order to handle them.

Device Virtualization

For full virtualization, the OS’ believe they have full ownership of the devices already. In actuality, the trap and emulate approach is in place by the hypervisor.

Para virtualization has much more interesting opportunities for innovation. It is possible for the hypervisor to create clean device abstractions for the guest OS. Shared buffers can also be exposed for the guest OS. There can also be innovations in the communication methods themselves, between the guest OS and the hypervisor.

Control Transfer

For full virtualization, control transfer happens implicitly. The guest OS attempts a privileged instruction, the hypervisor traps it. Software interrupts (events) from the hypervisor to the guest OS.

For para virtualization, control transfer happens explicitly through hypercalls from the guest into the hypervisor. Software interrupts (events) from the hypervisor to the guest OS. The guest has control via hypercalls on when event notifications need to delivered.

Data Transfer

For full virtualization, data transfer is implicit.

For para virtualization (e.g., Xen), data transfer is explicit, offering opportunities to innovate.

Xen provides asynchronous I/O rings. Which is a data structure that is shared between the guest and Xen for communication. Any number of the rings can be allocated for handling all the device I/O needs for a particular guest domain. The I/O ring itself is just descriptors that are available in the data structure. Requests from the guest OS can be placed in the I/O ring by populating the descriptors. Every descriptor is a unique I/O request coming from the guest OS, with a unique ID. When Xen is done processing a request, it will place the result in the same ring as the descriptor with the same unique ID.

This is a producer-consumer relationship, where the guest is the request producer, where it will write to a shared pointer, viewable by Xen, and Xen is the request consumer. In regards to responses, the relationship is flipped, and Xen is the response producer, where it will write to a shared pointer, viewable by the guest, while the guest is the response consumer.

Control and Data Transfer in Action

For Xen Data Transfer in action, we’ll use Network Virtualization as an example.

Each guest has 2 I/O rings. One is used for transmission, the other is for reception. In the case of transmission:

If the guest wants to transmit packets, it will enqueue descriptors in the transmit I/O ring via hypercalls provided by Xen.

The packets that need to be transmitted are not copied into Xen, but the buffers that contain the packets are in the guest OS buffers.

Pointers to said buffers are embedded in the descriptors that have been enqueued for transmission by the hypervisor.

For the duration of the transmission, the pages associated with the network packets are page pinned.

A round robin packet scheduler is used by Xen in order to transmit packets from different VMs.

In the case of reception:

When Xen receives a network packet for a particular guest OS, it exchanges the receive packet for one of the guest OS page that is being already provided to Xen as the holding place for incoming packets.

A guest OS will pre-allocate network buffers, which are pages owned by a guest OS.

Disk I/O Virtualization

Disk I/O virtualization works quite similarly to network virtualization. Every VM has an I/O ring which is dedicated for Disk I/O. The communication strives to avoid any copying into Xen. By enqueueing pointers to pre-allocated guest OS buffers, we can avoid any copying.

Xen may reorder requests from computing domains in order to make the I/O throughput efficient. There may be situations where such requests reordering may be inappropriate from the semantics of the I/O operation. In these situations, Xen provides a reorder barrier for guest OS’ to enforce operations to be done in the order they’ve been given in.

Measuring Time

VMs have good measures of space and time of each types of usage for CPU, Memory, Storage, and Network, in order to bill their clients appropriately.

Xen and Guests

Papers

Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, Andrew Warfield, "Xen and the Art of Virtualization ", SOSP 2003.

Carl Waldspurger, “Memory Resource Management in VMware ESX Server”, OSDI, 200.

Parallel Computing

Shared Memory Machines

All 3 Machine Model structures have CPUs, an interconnection network, and memory.

In dance hall architecture, you have CPUs on the opposite side of memory that are connected via an interconnection network.

In SMP, each CPU takes the same amount of time to access memory.

In DSM, each CPU has its own memory, and can still access all other CPU memory through the interconnection network.

Shared Memory and Caches

When the CPU wants to access a memory location, it first goes to the cache, and if its not it will fetch it from main memory and put it into the cache.

In a multi-processor, it acts the same way. However in the multi approach, if some value stored in more than one cache is changed, how do we update it across all caches? This is a cache coherence problem. Who should ensure this consistency?

This requires an agreement between hardware and software, called a memory consistency model.

Memory Consistency Model

There is no guarantee that two different processes will run steps before or after other steps are ran in a process.

The Sequential Consistency model will maintain the program order + arbitrary interleaving between processes. Its kind of like a merge shuffle for a casino shuffle.

Hardware Cache Coherence

Suppose a particular memory location writes to a memory location that has a shared piece of memory across another cache. In that case, the hardware will invalidate all the other instances of that memory in a cache. This is called write invalidate

In the write update scheme, however, the hardware updates it instead of invalidating it.

Keep overhead to a minimum, which will continue to grow as more CPUs are added, among other things as the system grows.

Scalability

There is overhead that increases as more processors are added.

To optimize performance, don’t save memory across threads as much as possible.

Synchronization

Summary

A lock locks data so that operations done on it are protected and can only be done by one particular process. An exclusive lock ensures that only one process can access at a time, but a shared lock allows simultaneous use, such as in examples of everyone reading the same thing.

Synchronization Primitives

Barrier synchronization ensures that all threads need to catch up to the barrier before they all continue.

Atomic Operations

We need an RMW (Read Modify Write) semantic for atomic instructions.

Scalability Issues with Synchronization

  1. Latency: If a thread wants to acquire a lock, there is time spent by a thread acquiring a lock.

  2. Waiting Time: If a thread wants to acquire a lock, how long does it need to wait before it has its turn with the data?

  3. Contention: If a thread uses a lock and releases it, maybe there is a bunch of threads waiting to access the lock and have to contend for access to the lock. How long does this contention take?

Naive Spinlock

The problem with this is there is too much contention, it does not exploit caches and disrupts useful work.

Caching Spinlock

In this, we assume that through the system bus, the hardware ensures that cache is coherent.

The CPUs spin locally in their own cache when they are waiting for the lock to release, and when the lock is unlocked it will try to access it, and if not, it will continue to spin locally.

If there is high contention, it will take O(n²) bus transactions where n is the number of processes trying to access the lock at a given time.

Spinlocks with Delay

Every processor will have different delays, that will dynamically increase every time the lock is checked. This doesn’t require spinning on a cached copy.

Ticket Lock

With the ticket lock model, we can ensure fairness by using a queue approach. For example, at a popular deli, you can take a ticket so that you know how many people are ahead of you, and when you will be ready to be served.

This achieves fairness but causes extra contention on the network since we need to keep track of the now-serving integer increasing.

Spinlock Summary

  1. Read with T+S: No fairness

  2. T+S with Delay: No fairness

  3. Ticket Lock: Fair but noisy

Array-based Queueing Lock

In this style, we can make a fair and non-noisy queue for getting a lock. However there is a higher space complexity, but that is the only real downside.

Linked Based Queueing Lock

Andersons Array based queue lock tends to be slightly faster, as linked lists have more overhead.

Summary

Communication

Barrier Synchronization

All threads have to wait until all threads reach the barrier.

This has a problem though. Before the last processor sets count to N, other processors may race to the next barrier and go through.

We need to add another condition so that the count is properly set. But this has two spin loops, and that’s not optimal. We want one spin loop.

Sense Reversing Barrier

There is too much sharing in this approach.

Tree Barrier

A more scalable version of this approach is by using the tree barrier divide and conquer approach.