OS Structures

SPIN Approach

What are we shooting for in an OS structure?

Thin, only mechanisms should be in the kernel, not policies in the kernel itself (like microkernel)
Access to resources, without border crossing as much as possible (like DOS)
Flexibility for resource management, so that resource management can be easily morphed to suit the needs of an application (like microkernel)
Flexibility without sacrificing protection and performance (like monolithic)

Approaches to Extensibility

Capability based (Hydra OS (Wulf ‘81))

Kernel Mechanisms (not policies) for resource allocation
Capability based resource access. A capability is an entity that can be passed from one to the other, cannot be forged, and can be verified.
Resource manager were built as coarse-grained objects to reduce border crossing overhead. Border crossing in Hydra OS would mean passing capability from one object to another, and validate the capability for entering a particular object. Building RM as coarse-grained objects limit the opportunities for customization and extensibility.
Hydra had good ideas for providing minimal mechanisms in the kernel, and having RM implement policies. Resource access of fundamental access through capabilities were heavy weight abstract notion for efficient implementation. But in practice, Hydra did not fulfill its goal of extensibility.

MKernel based (e.g., Mach from CMU in 90’s)

Mach focused on extensibility and portability
Provides limited mechanisms in the micro kernel, and implemented all services you’d expect from an OS as server processes that run as normal user level processes above the kernel.
Mach achieved its goal of extensibility and portability. Performance took a back seat.
This gave Micro Kernel based design bad press, since performance was not a priority, as OS is supposed to be very performant.

SPIN Approach to Extensibility

Keep a minimal kernel with its extensions in the same hardware address space. This avoids border crossings between the kernel and its extensions, greatly improving performance.
SPIN utilizes a strongly typed language as the basis of building the OS. In this way, SPIN also creates the necessary protections an OS needs. This means SPIN has a Compiler enforced modularity.
Data abstractions provided by the programming language such as an object, serve as containers for Logical Protection Domains. We are no longer reliant on hardware address spaces to provide protection between different services and the kernel.
The kernel only provides the interfaces, and the LPD are the ones that actually implement the functionality that is defined in the interface functions.
There can be several different implementations of interface functions. Kind of like how C has header files. This offers flexibility, because applications can dynamically bind different implementations of the same interface functions.
SPIN makes extensions as cheap as a procedure call

Logical Protection Domains

Modula-3 is a strongly typed language with built in safety and encapsulation mechanisms.

It has automatic memory management, meaning there are no leaks.
Supports data abstraction called Objects. Allows exposing externally visible methods inside an object using generic interfaces.
Supports threads that can run in the context of the object, and allows raising exceptions (i.e. memory access violation)
Modula-3 allows the creation of logical protection domains.
It allows us to get the safety of a monolithic kernel, without having to put system code in a separate hardware address space.
Objects that implement specific services can be the desired granularity of the system designer. Fine-grained protection can be done via capabilities. For example:
- Individual hardware resources (e.g., page frame)
- Interfaces that provide a certain functionality (e.g., page allocation module)
- Collection of interfaces (e.g., entire virtual machine)
Capabilities to objects can be supported as pointers. Capabilities in this sense are much lighter weight than previously discussed in the Hydra OS.
Pointers in Module-3 are type-specific, while in C, they are not.

Spin Mechanisms for Protection Domains

There are 3 Mechanisms in SPIN to create protection domains and use them.

Create

Allows creating a logical protection domain. Initiating an object file with the contents and export the names that are contained as entry point methods inside the object, to be visible outside.

Resolve

Resolve is similar to linking 2 separate compiled files together. Resolves the names between two object files. Names between source + target, two logical protection domains. They are dynamically linked and bound together. Accessing methods inside the target domain, resource sharing happens at memory speeds. As efficient as a procedure call.

Combine

Create aggregation larger protection domain. Once names in source and target domain are resolved, an aggregate domain can be created. The entry points are a union between source and target, or any number of domains.

Customized OS with Spin

The logical protection domain gives us the ability to extend SPIN to include OS services and make it all part of the same hardware address space. No border crossing between services and mechanisms provided by SPIN.

Two different extensions of SPIN can live on the same hardware framework. In this picture, while there is some common sub systems, process 2 uses a differently implemented memory manager than process 1.

Spin Mechanisms for Events

An OS needs to field events in order to process such events. SPIN has an event-based communication model. Services can register event handlers with the SPIN event dispatcher. It supports many mappings between event and event handler, such as 1-to-1 mapping, many-to-1 mapping, and 1-to-many mappings.

Default Core Services in SPIN

SPIN provides interface procedures for implementing services in the OS. Think of them like SPIN providing header files, that can have different Logical Protection Domain integrations. Once the LPD is properly instantiated, it becomes an extension of SPIN. There is no border crossing between SPIN and the extensions.

Memory Management:

Physical Address: Allocate, Deallocate, Reclaim
Virtual Address: Allocate, Deallocate
Translation: Create/Destroy Address spaces, Add/Remove mapping between virtual pages and physical frames
Event Handlers: Page fault, access fault, bad address

CPU Scheduling:

SPIN global scheduler interacts with the application (extension that is living atop SPIN, which could be an entire OS, or just an app) threads package. SPIN global scheduler decides at a Macro level, the amount of time that is given to a particular extension.
SPIN Abstraction Strand, which is a unit of scheduling that SPINs global scheduler uses. The semantics of strand is entirely defined by the extension.
Event handlers: Block, Unblock, Checkpoint, Resume

Conclusion

SPIN introduces a flexible and customizable system where applications can extend kernel functionality by a dynamically adding code modules. These extensions allow the system to adapt its behavior according to the needs of specific applications, improving performance and flexibility.

This method, however, would be consider unsafe in some instances, without the leveraging of language-based protection mechanisms that SPIN imposes on the OS. It utilizes Modula-3, a type-safe programming language, to prevent any unauthorized or harmful code from compromising the stability or security of the system. This ensures applications can add custom functionality while not performing dangerous operations that can corrupt the kernel data.

By utilizing these methods, and executing kernel extensions written in Modula-3 within the same address space, SPIN can avoid the overhead of traditional context switches and data copying. The use of fine-grained efficient event handling and safe access to kernel data structures allows the system to achieve high performance. SPIN uses an fine-grained event-based model where extensions are triggered by specific system events.

This paper presents experimental results showing how SPIN can achieve performance levels comparable to or better than traditional monolithic OS. The authors demonstrate this through various benchmarks, including file system and network performance evaluations.

The SPIN OS offers a novel approach to OS design by providing a framework that allows for safe and efficient extensibility by utilizing language-based safety mechanisms, as well as using the same hardware space. This makes SPIN compelling, particularly in scenarios where customization and efficiency are crucial.

Exokernel Approach

Exokernel comes from the fact that the kernel exposes hardware explicitly to the operating system extensions living above it. In the Exokernel approach, we are decoupling authorization of hardware from its actual use.

The request is processed in the Exokernel, and the hardware is then exposed through secure binding. The Exokernel then returns the encrypted key to the requesting library operating system. The semantics of how the resources are going to be used are irrelevant to the Exokernel and are up to the library OS. After all of this, the OS can utilize the resource by presenting the key it received to the Exokernel, in which the Exokernel will validate and then grant access.

In this way, the Exokernel serves as a sort of doorman between the OS and the Hardware.

Examples of Candidate Resources

TLB Entry

Virtual to Physical mapping done by library.
Binding is presented to Exokernel with encrypted key.
Exokernel puts the mapping into the hardware TLB.
The process that will be used by this new virtual page mapping can be done multiple times without Exokernel intervention.

Packet Filter

Needs to be executed by the Exokernel every time there is a packet arrival, on behalf of the OS.
Predicates for incoming packet are loaded into kernel by library OS, which is a heavy-weight, but once the binding has been established, it does not incur the intervention by the Exokernel.

Implementing Secure Bindings

There are three methods for implementing secure bindings.

Hardware mechanisms, such as TLB entry, retrieving a physical page frame from the Exokernel, or portions of the frame buffer that is being used by the display, can be bound by the Exokernel to the OS. Once the OS has an encrypted key for the resource, it can use it any time it wants.
Software caching, such as “Shadow” TLB, which means caching the hardware TLB in a software cache for each OS to avoid the context switch penalties from switching from one OS to another.
Downloading code into Kernel, which is functionally equivalent to SPIN extensions, such as with the Packet Filter. However, Exokernel does not have the same protection benefits as SPIN, because Exokernel has arbitrarily defined code that is downloaded into the kernel.

Default Core Services in Exokernel

Memory Management (in the example of a Page Fault)

When a thread incurs a page fault, it is first fielding by the Exokernel.
The Exokernel knows which threads belong to which OS, so then it informs the appropriate Library OS through a registered handler.
The OS then services the page fault which may involve requesting a page frame from the Exokernel to host the specific page that is missing.
If this happens, the library OS will ask the Exokernel for a page frame, a binding will be created, and returning a new encrypted key.
The Library OS then presents the mapping to the Exokernel to be placed in the hardware TLB.

Then, the Exokernel, if approved, will install the mapping in the TLB.

Memory Management using S-TLB

When we have a context switch, one of the biggest sources of performance lost is the fact that locality is lost when there is a newly scheduled process. When we switch from one Library OS to another, the TLB is completely wiped, which is massive overhead.

This is where the Software TLB (S-TLB) comes in. The S-TLB is a data structure in the Exokernel that represents the respective mappings for an OS.

On TLB clear, it will dump its information onto the S-TLB of the OS it was previously on. When there is an OS context switch, it will pre-load the TLB from the S-TLB of the respective OS.

Default Core Services in Exokernel (cont.)

CPU Scheduling maintains a linear vector of time slots. Time is divided into epochs of T₁ to T_n , called a Time Quantum. Each time quantum has a start and end time. These time quantums represent the time that is allocated to the Library OS’ that live atop the Exokernel. Each Library OS gets to mark its time quantum at startup in the linear vector of time slots.

If an OS misbehaves, such as taking more time in its time quantum than allowed, the Exokernel will remember and it will take time off from its time quantum the next time it is scheduled. During the time quantum, the OS has complete control of the processor, and the Exokernel will not get involved unless a process that is running on behalf of the OS incurs a page fault.

Revocation of Resources

The Exokernel keeps track of which resources are allocated to which Library OS’. At any given time, the Exokernel can revoke resources that it’s granted to any Library OS. The Exokernel can send a revoke call with a repossession vector, detailing the resources it is revoking from the Library OS.

The Library OS is then responsible to take corrective action to clean up and finalize tasks with the resources that are being revoked.

The Library OS can also seed the Exokernel with autosave options for resources the Exokernel wants to revoke. Ex. If an Exokernel decides to revoke page frames from Library OS, the Library OS could’ve seeded the Exokernel ahead of time such that it will dump the page frames onto the disk, on behalf of the OS.

Code Usage by Exokernel (Examples)

Packet filter for de-multiplexing network packets
Run code in Exokernel on behalf of Library OS not currently scheduled (e.g. garbage collection for a specific App)

Putting it all Together

Exokernel Data Structures

Exokernel maintains the PE data structure on behalf of each Library OS. The PE data structures contains the entry points in the Library OS for dealing with the different kinds of program discontinuties, such as exceptions, external interrupts, system calls, and addressing context. Each PE structure is unique for every Library OS. This PE structure setup is functionally similar to the SPIN OS event handler.

In addition to the PE data structure, we also have the S-TLB for each OS as well. These are called the “guaranteed mappings.” The entry point associated with addressing context is particularly important for guaranteed mappings requested by the Library OS to have in the Exokernel PE, which is the set of TLB entries that are dumped into the S-TLB. Only the entries that have been guaranteed to be kept on behalf of a particular OS are dumped into the S-TLB.

Downloading code into the kernel allows first level interrupt handling on behalf of a Library OS. In a situation where an interrupt happens in a separate OS instead of the relative OS, this is particularly useful, as the downloaded code in the kernel can handle the interrupt instead of having issues with another OS receiving it.

Performance Results of SPIN and Exokernel

When you research papers, absolute numbers are meaningless. When analyzing performance results, you need to compare it to the competition of its time. Performance questions are always centered around space (memory) and time (performance).

How much better is X approach over Y approach?

What is the code size difference between X OS vs. Y OS?

L3 Microkernel Approach

The L3 Microkernel, by proof of construction, debunks the myths about the assumed low performance MKernel based approach. As a recap, the MKernel approach was deemed low performance because of it’s constant border crossing nature that ensured protection and extensibility, at the supposed cost of performance.

It is possible to construct an L3 Microkernel where microservices are still in their own protection domain, but in the same hardware address space. Therefore, it’s all about how efficient the implementation is.

Strikes Against Microkernel

Explicit costs:

Kernel user switches, the cost of the border crossing is too high
Address space switches, the communication for cross protection domain calls involves flushing the TLB every time.
Thread switches and inter process communication have to be mediated by the MKernel, which involves many border crossings at every point.

Implicit cost:

Memory Effects, such as constant locality loss when going across address spaces.

Debunking MKernel Myths with L3

L3 by proof of construction, systematically debunks all the myths in the “Strikes Against Microkernel” section.

User Kernel Border Crossing

The user to kernel based switching border crossing cost is not as high as thought. The L3 Microkernel can do this border cross including TLB and cache misses, in 123 processor cycles. Compared to the CMU’s Mach OS, which SPIN and Exokernel both compare themselves to, can do it in 900 processor cycles.

Address Space Switches

In Address Space switches, the TLB is used to translate the virtual page number into a physical frame number.

On context switch, the TLB is flushed, which is costly when switching address space often. However, it is possible for it to not be flushed. The TLB can possibly store address based tags as well, meaning the entire TLB does not need to be flushed, depending on the situation.

In an architecture that supports address space tags in the TLB:

When an entry is made in the TLB, a Process ID (PID) is stored with the entry.
When address translation happens, the virtual page number (VPN) is split into the tag and index (like normal), but this time we can get an address space tag to compare with the PID.
If there is a hit, then we can check the tag of the VPN and compare it to the tag in the TLB.
If both match, there is no need to flush the TLB on the context switch.

Liedtke, the author of the L3 Microkernel approach, suggests taking advantage of whatever the architecture offers you in regards to avoiding TLB flush. For example:

In x86 and PowerPC, segment registers are offered. Segment registers give an OS the opportunity to specify a range of virtual addresses that can be legally accessed by the currently running process. We can use segment registers to create a protection domain to bound addresses for a running process. The segment bounds are hardware enforced.

What if a protection domain is so large that it needs all of the Hardware address space?

If they are so large that it occupies the entire hardware address space, and the TLB does not support address based tagging, then you have to do a TLB flush on a context switch.

In this case, the explicit costs we’ve been talking about becomes insignificant compared to the much more massive implicit cost that is incurred. The cache effects of a large context switch of processes that take the entire hardware address space would have significant cost.

If we are switching in Small Protection Domains:

By taking advantage of architecture/hardware options given to you, careful construction of processes can render address space switching efficient.

If we are switching in Large Protection Domains:

Switching cost is insignificant relative to the implicit costs
The cache effects and TLB effects dominate, are a much more significant implicit costs

Thread Switches and IPC

The explicit cost of a thread switch, in which the Microkernel begins running another thread, involves saving all volatile state of the processor. By proof of construction, it is shown to be competitive to SPIN and Exokernel approaches.

Memory effects

Memory effects can be mitigated by carefully structuring the protection domains and the hardware address space. In the case that the protection domains are small, Liedtke suggests to pack all the protection domains in the same hardware address space, enforcing protection for the processes from one another by utilizing segment registers. In this way, the caches will be warm even on context switches.

In debunking the myths regarding address space switches, we also have strategies to mitigate memory effect issues as well.

Why was border crossing in Mach so expensive?

The significant focus on portability made it such that it was architecture independent to allow the Mach Microkernel to run on several different process architectures. This meant there was code bloat in Mach, because it has to have compatibility with any architecture it aims to run on. Specifically in Mach, there are architecture independent and architecture specific parts of the microkernel.

This means Mach has a large memory footprint, which means it has lesser locality, which means it has more cache misses, therefore there is longer latency for border crossings.

Portability and performance contradict each other.

Thesis of L3 of OS Structuring

The principles of L3 are as follows:

The microkernel should have minimal abstractions. This includes support for threads, address spaces, interprocess communication, and generating unique IDs. We need these 4, as ANY subsystem in an OS should have these in a microkernel.
Microkernels are processor-specific in implementation. Avoid portability if you want an efficient implementation of a microkernel. You need to exploit anything a certain architecture offers you.

The right set of microkernel abstractions and processor-specific implementation, you can build efficient processor-independent abstraction at higher layers.

Papers

Brian Bershad et al., "Extensibility, Safety and Performance in the SPIN Operating System ", Proceedings of the 15th ACM Symposium on Operating System Principles, December 1995.

Dawson R. Engler, Frans Kaashoek and James O'Toole, "Exokernel: An Operating System Architecture for Application-Level Resource Management ", Proceedings of the 15th ACM Symposium on Operating System Principles, ACM, December 1995.

J. Liedtke, " On Micro-Kernel Construction ", Proceedings of the 15th ACM Symposium on Operating System Principles, ACM, December 1995.

J. Liedtke, "Improved Address-Space Switching on Pentium Processors by Transparently Multiplexing User Address Spaces ", GMD Technical Report No. 933, November 1995 (self-study).

Virtualization

Introduction to Virtualization

Platform Virtualization

A black box that can run atop a virtual platform. The end user is satisfied as long as the platform has the same functionality as what the system appears to have.

Utility Computing

Multiple apps and OS’ can be run atop shared hardware resources. Different apps are run and they lower overall server costs, due to the fact that computing may not be used all at the same time.

Utility computing is promoted by data centers worldwide, and it is how different end-users are able to access more computing resources for their specific peak times, versus having to purchase their own computers.

Hypervisors

How is it possible that multiple different OS’ can run atop the same shared hardware resources? Who decides who gets what? A hypervisor is like an OS of OS’.

There are native (bare metal) hypervisors, which runs directly atop the shared hardware, and the OS’ run atop the hypervisor.

The second type is called hosted hypervisors. The hosted hypervisor runs atop a host OS, and allows the user to emulate the function of other OS’.

Connecting the Dots

The concept of virtualization started in the 70s by IBM VM370. The intent was to give the illusion that every user on a computer had ownership of said computer.
Microkernels were introduced in the late 80’s and 90’s
Extensibility of OS was investigated in the 90’s
The SIMOS was introduced in the late 90’s and laid the basis for the modern resurgence of virtualization technology, and was the basis for VMware.
Xen and VMware were based on papers laid out in the early 2000’s.

Now, accountability for usage in data centers. Companies can provide cloud virtualization services and bill users separately. Virtualization has made computing another utility, like water or electricity.

Full Virtualization

The Full Virtualization framework leaves the OS basically untouched, allowing you to run the unchanged binary of the OS atop the hypervisor. The OS’s are run as user level processes atop the hypervisor, which means they do not get the same privileges as running them directly on bare metal.

When the OS runs some privileged instructions, those instructions go through a trap in the hypervisor, where the hypervisor will emulate the intended functionality in the OS.

In some architectures, the privileged actions may fail silently. The hypervisor will try to look for the issues that may arise in those OS’, and then if those failures do happen, it will try to capture and fix them. Early on, this was a more significant problem, but now architectures are much better at supporting virtualization natively.

Para Virtualization

Another approach to virtualization is to modify the source code of the guest OS. Doing this, we can avoid problematic instructions, and include optimizations between the OS and the underlying hardware.

Less than 2% of the Guest OS code needs modification with para virtualization. This was proven in Xen by proof of construction.

Big Picture

Virtualize Hardware:

Memory Hierarchy
CPU
Devices

Effect data and control between guests and hypervisor

Memory Virtualization

Memory Hierarchy

Cache is physically tagged, so there isn’t much work that needs to be done regarding virtualizing them. The main struggle comes when handling virtual memory, mapping virtual addresses to the physical, which is a key functionality in the memory management in any subsystem.

Memory Subsystem Recall

Each process is in its own protection domain, and usually, a separate hardware address space. The OS maintains a page table on behalf of each of the processes. The page table is an OS data structure that holds the mapping between the virtual page numbers and the physical pages where those virtual pages are contained in the main memory of the hardware.

The physical memory is continuous, but the virtual addresses are not pointing to contiguous sections of physical memory, instead the locations are scattered across physical memory.

Memory Management and Hypervisor

In the virtualized setup, the hypervisor sits between the guest OS and the hardware. Each process in each OS is in its own protection domain, which means each process has a distinct page table in it’s OS. The hypervisor doesn’t know about these page tables stored in the OS’.

Memory Manager Zoomed Out

The physical memory AKA machine memory is in the control of the hypervisor. When the OS sees its “physical memory” it is actually an illusion shown by the hypervisor. It is not actual contiguous physical memory, but instead the hypervisor places the regions of physical memory in the machine memory. The needs of an OS may change through its lifespan, and the hypervisor would potentially need to allot more machine memory to an OS in a non-contiguous manner.

Zooming Back In

In a non-virtualized setting, the virtual page number goes through the page table and results with a physical page number.

Non-Virtualized Setting:

VPN → PT → PPN

In a virtualized setting, however, there is another level of indirection. The physical page number of the OS has to be mapped to the machine memory / page numbers (MPN). The mapping between the physical page number (PPN) and the machine page number (MPN) is kept in a shadow page table (SPT).

Virtualized Setting:

VPN → PT → PPN → S-PT → MPN

In the case of full virtualization, the guest OS does not know it is not running on bare metal, and therefore the hypervisor stores the PPN → MPN mappings.

In the case of para virtualization, the guest OS is aware it is being virtualized, and it can and usually does store the PPN → MPN mappings instead of the hypervisor.

Shadow Page Table

In many architectures the CPU uses the page table for address translation. In the case of virtualization, the regularly used hardware page table is now used as the shadow page table.

Efficient Mapping (Full Virtualization)

When a fully virtualized guest OS attempts to map a VPN to PPN through the page table, it requires a privileged action. This initiates a trap in the hypervisor, where the hypervisor will correspond the particular VPN to an entry in the S-PT. The translations are installed into the TLB / hardware PT.

Efficient Mapping (Para Virtualization)

In a para virtualized setting, the guest OS is aware that the physical memory is non-contiguous. The burden of efficient mapping can be moved from the hypervisor to the guest OS.

In Xen (a paravirtualized hypervisor), a set of hypercalls is provided for the guest OS to tell the hypervisor about the changes made to the hardware page table.

Guest OS makes hypercall to the hypervisor, allowing the guest OS to, for example, initialize a page frame, and then ask the hypervisor to create a page table on behalf of the OS with that initialized page frame. Other supported actions include switching as well as updating page tables.

Dynamically Increasing Memory

The hypervisor must be ready to handle the different memory needs that guest OS’ require by allocating them regions on the machine memory. If all the machine memory is taken up, the hypervisor can take back a portion of physical memory from another OS. Although, some memory issues may be ran into, it is also possible for the hypervisor to request more memory from the OS, so that it can voluntarily give some up to its peers.

Ballooning

The technique, ballooning, has a special device driver installed in every guest OS, regardless of full or para virtualization. The balloon device driver is the key for managing memory pressures that may be experienced by a virtual machine in a virtualized setting.

If a hypervisor needs more memory, it can contact a guest OS’ balloon driver through a private channel that exists between the balloon and the hypervisor. The hypervisor can then instruct the balloon to inflate itself in the guest OS, making it so the balloon requests more memory from the guest OS. After the balloon has gotten its necessary memory, it can return the physical memory gathered to the hypervisor.

If a hypervisor needs less memory, it can tell the guest OS’ balloon driver to deflate. The balloon deflating can release memory into the guest OS.

It is possible, depending on the situation, to share space in the machine memory. For example, if two VMs are both running the same versions of Linux, with the same versions of Firefox, it is possible that in order to not store duplicate data, aspects of the software’s virtual page memory can be pointed to the same place in the machine memory.

One way to do it, is with cooperation between the the VMs and the hypervisor, where there are hooks that can inform each other of the situation.

An alternative, is for the VMs to be completely oblivious to the page sharing that is happening. The hypervisor can have a hash table that contains content hash of the machine pages.

Take the contents of a page, create a content hash representing that page which corresponds to the PPN of the specific memory in the VM.
Then, take the hash, and look through the hypervisor hash table to see if there is a match between the hash and any page currently in the machine memory.
If there is a match, we have a hint that the data stored is a duplicate. For example, the other VM for which the matched data is a part of, could’ve modified the data in some kinda way.
Then, a full comparison must be made, in order to verify it is in fact duplicate data.
If there is a successful match, we can newly map the VM memory to the same duplicate spot in machine memory. Once we have done that, the reference table can increment a count representing how many maps there are to that place in memory.
The entries are then marked as copy on write entries, indicating that they can share the page as long as they are reading it, but if any of them try to write to it, a copy of the original page will be made and things will be remapped.
After this successful match process is done, the page frame can be freed up in the machine memory.

Memory Allocation Policies

Different policies are in place for allocating and reclaiming memory. The goal of virtualization is maximization of the utilization of resources. Virtualized environments use different policies for memory allocation, such as:

Pure share based approach: You pay less, you get less. The data center gives you a certain amount of resources directly correlated with the money you pay for said memory. This can potentially lead to holding if a VM gets a bunch of resources and holds onto it without using it.
Working-set based approach: If a VM process / working set needs more memory, you give it more memory, if it needs less memory, you give it less. However, some might feel that if you paid for a certain amount of resources, you should be able to always get those resources.
Dynamic idle-adjusted shares approach: Mixes both approaches. Tax idle pages more than active pages. If the resources I’ve given you are being actively used, more power to you. But if you are just holding it, the resources are taken away or “taxed.” Different data centers can have different taxation rates, with higher rates promoting a more “wealth-distribution” approach, and lower rates corresponding to a more pay for what you get approach.

With a tax-rate somewhere in the middle, most idle memory can be reclaimed, but because its not being taxed at 100%, sudden working set increases are allowed to happen.

CPU & Device Virtualization

We aim for the guest OS on a hypervisor to have the illusion of owning the CPU. We also want the hypervisor to field events arising due to the execution of a process that belongs to a parent guest OS.

CPU Virtualization

In each guest OS, the processes are already being multiplexed by the OS even in a non-virtualized setting. The hypervisor has to give an illusion of ownership of the CPU to each guest OS. The hypervisor needs a precise way to account the time each guest OS is allocated on the CPU.

It is possible for the hypervisor to give the guest OS’ a proportional share, depending on the service agreement a VM has with the hypervisor. There is also a fair share scheduler, which gives a fair (equal) share to each guest OS.

In either case, the hypervisor must account for time that is taken by another guest OS during a guest OS’ time, such as is the case with external interrupts. If the time is taken away during a guest OS’ time, it will be rewarded back in the future.

Events that are currently executing on the CPU as dictated by a process running on it need to be delivered to the parent guest OS, in both the full and para virtualized settings. A process may incur a system call, page fault, exception, or external interrupt. All of these discontinuities have to be passed up by the hypervisor to the guest OS. The hypervisor needs to deliver these as software interrupts to the respective guest OS.

Some of the ways to handle these discontinuities, however, require privileged access to be able to handle them through the guest OS. In a full virtualization environment, it is possible for the guest OS to attempt a privileged action, get trapped by the hypervisor, and then fail silently. The hypervisor needs to account for potential quirks in order to handle them.

Device Virtualization

For full virtualization, the OS’ believe they have full ownership of the devices already. In actuality, the trap and emulate approach is in place by the hypervisor.

Para virtualization has much more interesting opportunities for innovation. It is possible for the hypervisor to create clean device abstractions for the guest OS. Shared buffers can also be exposed for the guest OS. There can also be innovations in the communication methods themselves, between the guest OS and the hypervisor.

Control Transfer

For full virtualization, control transfer happens implicitly. The guest OS attempts a privileged instruction, the hypervisor traps it. Software interrupts (events) from the hypervisor to the guest OS.

For para virtualization, control transfer happens explicitly through hypercalls from the guest into the hypervisor. Software interrupts (events) from the hypervisor to the guest OS. The guest has control via hypercalls on when event notifications need to delivered.

Data Transfer

For full virtualization, data transfer is implicit.

For para virtualization (e.g., Xen), data transfer is explicit, offering opportunities to innovate.

Xen provides asynchronous I/O rings. Which is a data structure that is shared between the guest and Xen for communication. Any number of the rings can be allocated for handling all the device I/O needs for a particular guest domain. The I/O ring itself is just descriptors that are available in the data structure. Requests from the guest OS can be placed in the I/O ring by populating the descriptors. Every descriptor is a unique I/O request coming from the guest OS, with a unique ID. When Xen is done processing a request, it will place the result in the same ring as the descriptor with the same unique ID.

This is a producer-consumer relationship, where the guest is the request producer, where it will write to a shared pointer, viewable by Xen, and Xen is the request consumer. In regards to responses, the relationship is flipped, and Xen is the response producer, where it will write to a shared pointer, viewable by the guest, while the guest is the response consumer.

Control and Data Transfer in Action

For Xen Data Transfer in action, we’ll use Network Virtualization as an example.

Each guest has 2 I/O rings. One is used for transmission, the other is for reception. In the case of transmission:

If the guest wants to transmit packets, it will enqueue descriptors in the transmit I/O ring via hypercalls provided by Xen.

The packets that need to be transmitted are not copied into Xen, but the buffers that contain the packets are in the guest OS buffers.

Pointers to said buffers are embedded in the descriptors that have been enqueued for transmission by the hypervisor.

For the duration of the transmission, the pages associated with the network packets are page pinned.

A round robin packet scheduler is used by Xen in order to transmit packets from different VMs.

In the case of reception:

When Xen receives a network packet for a particular guest OS, it exchanges the receive packet for one of the guest OS page that is being already provided to Xen as the holding place for incoming packets.

A guest OS will pre-allocate network buffers, which are pages owned by a guest OS.

Disk I/O Virtualization

Disk I/O virtualization works quite similarly to network virtualization. Every VM has an I/O ring which is dedicated for Disk I/O. The communication strives to avoid any copying into Xen. By enqueueing pointers to pre-allocated guest OS buffers, we can avoid any copying.

Xen may reorder requests from computing domains in order to make the I/O throughput efficient. There may be situations where such requests reordering may be inappropriate from the semantics of the I/O operation. In these situations, Xen provides a reorder barrier for guest OS’ to enforce operations to be done in the order they’ve been given in.

Measuring Time

VMs have good measures of space and time of each types of usage for CPU, Memory, Storage, and Network, in order to bill their clients appropriately.

Xen and Guests

Papers

Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, Andrew Warfield, "Xen and the Art of Virtualization ", SOSP 2003.

Carl Waldspurger, “Memory Resource Management in VMware ESX Server”, OSDI, 200.

Parallel Computing

Shared Memory Machines

All 3 Machine Model structures have CPUs, an interconnection network, and memory.

In dance hall architecture, you have CPUs on the opposite side of memory that are connected via an interconnection network.

In SMP, each CPU takes the same amount of time to access memory.

In DSM, each CPU has its own memory, and can still access all other CPU memory through the interconnection network.

Shared Memory and Caches

When the CPU wants to access a memory location, it first goes to the cache, and if its not it will fetch it from main memory and put it into the cache.

In a multi-processor, it acts the same way. However in the multi approach, if some value stored in more than one cache is changed, how do we update it across all caches? This is a cache coherence problem. Who should ensure this consistency?

This requires an agreement between hardware and software, called a memory consistency model.

Memory Consistency Model

There is no guarantee that two different processes will run steps before or after other steps are ran in a process.

The Sequential Consistency model will maintain the program order + arbitrary interleaving between processes. Its kind of like a merge shuffle for a casino shuffle.

Hardware Cache Coherence

Suppose a particular memory location writes to a memory location that has a shared piece of memory across another cache. In that case, the hardware will invalidate all the other instances of that memory in a cache. This is called write invalidate

In the write update scheme, however, the hardware updates it instead of invalidating it.

Keep overhead to a minimum, which will continue to grow as more CPUs are added, among other things as the system grows.

Scalability

There is overhead that increases as more processors are added.

To optimize performance, don’t save memory across threads as much as possible.

Synchronization

Summary

A lock locks data so that operations done on it are protected and can only be done by one particular process. An exclusive lock ensures that only one process can access at a time, but a shared lock allows simultaneous use, such as in examples of everyone reading the same thing.

Synchronization Primitives

Barrier synchronization ensures that all threads need to catch up to the barrier before they all continue.

Atomic Operations

We need an RMW (Read Modify Write) semantic for atomic instructions.

Scalability Issues with Synchronization

Latency: If a thread wants to acquire a lock, there is time spent by a thread acquiring a lock.
Waiting Time: If a thread wants to acquire a lock, how long does it need to wait before it has its turn with the data?
Contention: If a thread uses a lock and releases it, maybe there is a bunch of threads waiting to access the lock and have to contend for access to the lock. How long does this contention take?

Naive Spinlock

The problem with this is there is too much contention, it does not exploit caches and disrupts useful work.

Caching Spinlock

In this, we assume that through the system bus, the hardware ensures that cache is coherent.

The CPUs spin locally in their own cache when they are waiting for the lock to release, and when the lock is unlocked it will try to access it, and if not, it will continue to spin locally.

If there is high contention, it will take O(n²) bus transactions where n is the number of processes trying to access the lock at a given time.

Spinlocks with Delay

Every processor will have different delays, that will dynamically increase every time the lock is checked. This doesn’t require spinning on a cached copy.

Ticket Lock

With the ticket lock model, we can ensure fairness by using a queue approach. For example, at a popular deli, you can take a ticket so that you know how many people are ahead of you, and when you will be ready to be served.

This achieves fairness but causes extra contention on the network since we need to keep track of the now-serving integer increasing.

Spinlock Summary

Read with T+S: No fairness
T+S with Delay: No fairness
Ticket Lock: Fair but noisy

Array-based Queueing Lock

In this style, we can make a fair and non-noisy queue for getting a lock. However there is a higher space complexity, but that is the only real downside.

Linked Based Queueing Lock

Andersons Array based queue lock tends to be slightly faster, as linked lists have more overhead.

Summary

Communication

Barrier Synchronization

All threads have to wait until all threads reach the barrier.

This has a problem though. Before the last processor sets count to N, other processors may race to the next barrier and go through.

We need to add another condition so that the count is properly set. But this has two spin loops, and that’s not optimal. We want one spin loop.

Sense Reversing Barrier

There is too much sharing in this approach.

Tree Barrier

A more scalable version of this approach is by using the tree barrier divide and conquer approach.

Tournament Barrier

This barrier is organized in the form of a tournament with n players and log(n) rounds. In the example with 8 players, there will be 3 rounds. There will be 4 matches in the first round, where the tournament will always be rigged where the winner is predetermined.

If the processes are executed on a shared-memory machine, then the winner can sit and wait for the losing processor to inform the winner that they have won. This is especially useful if you do not have a cache-coherent multi-processor (ex. NCC-NUMA).

The winners of each round will go to the next round, where the winners are still predetermined, all the way up until the final round, which in our example is round 3.

The main purpose of this arrangement is that the spin location for each process waiting on the other process is statically determined at every level.

When the champion of the tournament is declared, we know that everyone has arrived at the barrier, and at this point, the champion is the only one who knows. The champion will then wake up the loser of that round, and then they will go back to every level and wake up the loser of the previous match, and so on. After everyone is awake, the next phase of computation can begin.

Dissemination Barrier

This form of barrier utilizes information diffusion. N doesn’t need to be a power of 2. It’s like a “well-orchestrated gossip protocol” where in each round, a processor will send a message to another ordained processor. The particular processor that its going to choose to send it to is dependent on the round it’s in.

For each round up to k rounds, P at position i will send to P at position (i + 2^k) % n. Each of the processors makes an independent decision that the round is over based on the fact that they’ve both received and sent a message. There are O(n) communication events per round.

MPI

Message Passing Interface (MPI) exists when messages must be passed between computers in a distributed computing context. This means we need a protocol for sharing information between processes across multiple computers that don’t share the same memory, CPUs, etc. This is where MPI comes in.

Communication Domain:

Processes that can communicate with each other
Stored in communicator
Communicator type: MPI_Comm
Predefined default: MPI_COMM_WORLD

MPI_Init: Initializes all the data structures and begins running MPI. This is called before any other MPI call.

MPI_Finalize: Turns off communications, frees memory, etc. before exiting back to the OS.

MPI_Comm_size: Allows us to get the number of processes that are communicating inside the communicator.

MPI_Comm_rank: This “tells us who we are”, which is an index in the communicator for a local process.

MPI_Send: When we want to transmit something from our process to another, we call this. The buffer is passed through as a pointer to where our data is stored, the count is the number of objects of type datatype we want to send. The dest is the rank (recall earlier rank means the local index of process on communication) where the data needs to be sent. The tag is a value that can be used to identify a type of message such that it can be distinguished as a certain type of message according to the way we write our program.

MPI_Recv: The buffer points to memory for where the data should be stored. The source is the rank of the sending messenger. The status parameter gives us information about the success/failure of this communication from another process. Everything else is about the same/can be implied from the MPI_Send function.

Distributed Systems

Definitions

The individual autonomy of a distributed system is what separates it from general parallel computing.

What is a distributed system?

A distributed system is interconnected by some kind of LAN/WAN network (fiber, cable, satellite, etc.). There is no physical shared memory between nodes. The event computation time; the computational time on a single node to do some meaningful processing is called “event computation time” called te. And the communication time/messaging time is called tm. The third property of a distributed system is that the time for communication between the nodes in the system is significantly larger than the event computation time. tm » te.

Lamport has a definition for a distributed system “A system is distributed if the message transmission time, tm, is not negligible to the time between events in a single process.” Even a cluster is a distributed system. The importance of inequality is in the design of algorithms that are going to span the nodes of the network.

“Happened Before” Relationship

Lamport Clocks

Each node knows its own events, and its communication events. These are the only events that each node knows about.

Lamports logical clock builds on this idea, such that each event will have a time stamp.

We will have a local clock for each event in own event times, which monotonically increases. C1(a) < C1(b)
For communication events, the timestamp of the receiver event will be greater than that of the sender. It also depends on the current state of the local counter, and must be greater. C1(a) < C2(d) => Choose C2(d) = Max(C1(a)++, C2)

Concurrent Events (b and d) have arbitrary timestamps
If you have two events, and it so happens that the timestamps associated with the event does not inherently mean that x happened before y. C(x) < C(y) ≠> x→y

Total Order

Distributed Mutual Exclusion Lock Algorithm

To request a lock, a process will message the rest of the processes with its local timestamp, they will take that request and put it in their local priority queue, and then acknowledges the request to its peers. When there is a tie, the priority is given to the process with the lower process id.

Now, how does a process know if it has access to the lock? First, it must have its request be at the top of the priority queue. Second, it must have acknowledgements from all the other nodes in the system or all the requests that it’s gotten so far are later than it’s lock request.

On lock release, a process will send an unlock message to its peers. When the peers receive the unlocked message, the peers remove the request from their respective queues.

Lamports Physical Clock

a |→ b => Ci(a) < Cj(b): means a in physical time happened before b.

Physical clock conditions:

PC1 (bound on individual clock drift)
PC2 (bound on mutual drift)

IPC Time and Clock Drift

Latency Limits

Latency vs Throughput

Latency: Elapsed Time

Throughput: Events per unit of time

Bandwidth: Measurement of Throughput

RPC Performance: Hardware Overhead and Software Overhead

Components of RPC Latency

Marshaling and Data Copying

The arguments in a function don’t know about the specific details of it all. There is three copies during this process, the copying is one of the biggest overhead for RPC. To reduce copies, we can marshal into kernel buffer directly. This involves dumping code directly into the kernel.

Or, we can use a shared descriptor between the client stub and kernel. This can help inform the kernel to provide info to the kernel about the layout of arguments on the stack.

Control Transfer

There is potentially 4 context switches in control. The client will send a call out, the server will take the call in and switch the respective procedure. Then the server will have to switch processes and send a result output call, where the client will take the result in, and be regiven the control.

The critical paths in the path of latency is when the call comes in for both the server and client. We can reduce the context switches down to 2 by overlapping the non-critical paths with network communication.

We can reduce it to 1 by spinning instead of switching on the client side.

Protocol Processing

What transport for RPC? LAN is reliable => reduce latency

Choices for reducing latency in transport, in the case we assume the LAN is reliable.

No low level acknowledgements. The semantics of RPC involves having a return, which can serve as a form of acknowledgement.
Hardware checksum for packet integrity. Instead of using software checksum, we can make protocol processing leaner by using only hardware checksum.
No Client side buffering since client blocked. Since the client is blocked, we don’t need to buffer, we can just resend the call when necessary.
Overlap service side buffering with result transmission.

Active Networks

Routing on the Internet

Each router makes a decision to hop between routers through each . They are forward packets at each decision. By making them active, which instead of using a table lookup, it makes the hop dynamic and carries code with them. This allows for customized service to happen during network flow. How and who can write this code to virtualize network behavior by making it active?

The OS must provide quality of service APIs to synthesize code that can be part of the payload. Changing the protocol stack of every node in the universe is very difficult, though.

ANTS Toolkit

The ANTS Toolkit is an application level package, which adds a ANTS header onto the payload. The new packet that is generated will be traversed on the network, and if a regular node that cannot process the ANTS header exists, it will treat it normally, however, if a node that is more sophisticated processes it, the ANTS header can be utilized. Keep the active nodes only at the edges of the network, which allows the core IP network to be unchanged.

ANTS Capsule + API

The header is split into the version, type, prev, header. The capsule itself does not contain the code, but it contains a type field which can identify the code needed.

When the node receives the capsule, it will use the previous node field from the capsule to send a request to the previous node and get the capsule code to send to the current node. The capsule will calculate the fingerprint using the code, and compare that against the type field to ensure its correct. After the capsule has gotten the code, it will save it in soft store for future use. If the capsule cannot find the code, it simply drops the capsule.

Pros and Cons of Active Networks

Pro

Flexibility from App Perspective

Cons

Protection Threats
- ANTS Runtime Safety => Java Sandboxing
- Code Spoofing => Robust Fingerprint
- Soft State Integrity => Restricted API
Resource Management Threats
- At Each Node => Restricted API
- Flooding the Network => Internet Already Susceptible

Some roadblocks with the active networks vision is we need router vendors to buy into it. ANTS Software routing cannot match throughput needed in Internet Core. Routers traditionally just do table lookups quickly and move onto the next thing, using more intelligent software routing can lead to slower processes.

Feasibility

Router Makers loath opening up the network => Only feasible at the edge of the network.

Software Routing cannot match hardware routing => Only feasible at the edge of the network.

Social + Psychological reasons => Hard for user community to accept arbitrary code executing in the public routing fabric.

There isn’t a killer app that utilizes active networks when it was first introduced, it seemed like a solution for a problem that didn’t really exist. Software defined networking has given active networks a new lease of life. Cloud computing has done much for virtualizing networks, like how active networks approach the problem.

Systems from Components

Big Picture

In the specification stage in the design cycle, we can use IOA, which has C-like syntax and composition operator. In the code stage, we can use OCAML, which is object oriented, efficient code similar to C, and a nice complement to IOA. Then NUPRL is used to optimize the OCAML code. It takes in as input unoptimized OCAML code and outputs optimized OCAML code. The output is verified to be functionally equivalent.

Digging Deeper

The abstract behavioral spec using IOA can prove the properties facillitated by IOA. Then we can refine it to get into a concrete behavioral spec. Then we can use Ocaml to implement the concrete behavioral spec. We are taking many components together and meshing them together, which OCaml is good at doing.

There is no guarantee that the unoptimized OCaml implementation is the same as IOA specification. The properties are proven, but the implementation may not be faithful to the abstract behavioral spec.

By using a theorem prover framework, you can convert OCaml code that is unoptimized to unoptimized NuPrl code. Then you can use the theorem prover framework to optimize the NuPrl code and it’s proven to be functionally equivalent. Afterwards, the NuPrl code will be converted back to OCaml, optimized.

Putting the Methodology to Work

Start with IOA Spec, going from abstract and conrete. How do we synthesize the stack from the concrete spec?

Getting to an unoptimized OCaml Implementation:

Ensemble suite of microprotocols
- Flow Control, sliding window, encryption, scatter/gather, etc.
Well Defined interfaces allowing composition
Facilitates component based design

Recall original goal: Mimic VLSI design

How to Optimize the Protocol Stack

Layering could lead to inefficiencies

Analogy to VLSI component based design breaks down for software

Interfaces in software have more inefficiencies, like copying

Several Sources of Optimization Possible

Explicit memory management instead of implicit GC
Avoid Marshaling / Unmarshaling across layers
Buffering in parallel with transmission
Header Compression
Locality Enhancement for common code sequences

How do we automate this process?

NuPrl to the Rescue

Go layer by layer in the protocol stack and check to see if the optimization theorems can optimize the code. The Theorem proving framework can do things by collapsing the layers. It can generate bypass code if CCP is satisfied. CCP stands for common case predicate. We can do much simpler processing if the CCP is satisfied. Otherwise, the more manual approach must be used. However, because its layer by layer, it’s possible to do it on a layer approach.

Then we convert back to OCaml. We can prove optimization and functional equivalence, as well.

Distributed Objects and Middleware

Spring OS

How to innovate OS?

Brand new OS? or Better implementation of known OS?

Market place needs large complex server software => Take the Intel inside approach… in this case unix inside for box makers like sun microsystems

If you’re a company like sun microsystems, you want the externel interfaces to stay the same, but you innovate internally so that extensibility, flexibility, etc. can be better

Object based vs Procedural Design

Objects contain the state and the methods that manipulate said said locally to the object. You get strong isolation. Procedural design has more shared state approach. In spring, Object is used to build the OS kernel.

Spring Approach

Build strong interfaces for each subsystem. Open, flexible, and extensible. Spring uses IDL which allows third party software vendors to build their own software that integrates with the spring subsystem. This is similar to the microkernel approach, following liedtkes principle. It only handles CPU and memory.

Nucleus Microkernel of Spring

The nucleus has many doors where domains create in the nucleus and they communicate with a target domain. The nucleus is what holds the threads. IF a client wants to invoke something in a target domain, they will have a door table that points to a particular door where you are able to access through the nucleus door. Each domain has a unique door table.

The spring kernel is composed of the nucleus and the memory manager of the domains representation in address space. The nucleus is involved in every door call and manages the permissions accordingly. Then, the thread will be allocated to the target domain to do actions, and then it is deactivated and given it back to the domain. Cross domain calls can be done very quickly by utilizing the nucleus.

Object Invocation Across the Network

Network proxies are used to connect domains which may have different nucleus connected. Proxies may also have different protocols and you have flexibility in how you build your network OS. Proxies invisible to the client + server. They are unaware where the server or client is located, and they don’t care either.

For example: Proxy A: Export net handle embedding Door-X to Proxy B
Proxy B: Will use said net handle to connect the nuclei. When a client domain makes an invokation, it thinks it is directly accessing the server domain, but it may go through Door Y in Nucleus B to access Proxy B, which gets an exported net handle embedding from Proxy A which gets it from Door X which is gotten from the server domain.

Secure Object Invocation

The client domain will go through the door and then access. afront object. THe front object is what the client domain sees, and then it accesses an underlying object where the real stuff is checked. Policies can be associated with the front object.

ACL Checked before invocation, which is point to from the Underlying object

Virtual Memory Management in Spring

Regions are a set of a pages. It is done in linear address space. And then memory objects map the regions to the memory objects. The memory object represents a portion of memory space to map to files, swap space, etc.

Memory Object Specific Paging

Memory Object is connected to a pager object, and the pager object is connected to a cache obj in a virtual memory manager. It can split these into any regions. inwants. There isn’t a single pager object required for all the cache objects. It’s possible to have a cache obj representation that is in a different address space pager object which maps to some other memory object. The pager object is responsible for coordination between any complex memory management scenario. Address Space managers are responsible for linear space, and mapped to cache object resposible by pager objects managing different regions of memory.

Summary

Object Oriented Kernel

nucleus → threads + IPC
microkernel → nucleus + address space
Door + door table → basis for cross domain calls
Object invocation. andcross machine calls
Virtual memory management
- Address space object, memory object, externel pagers, cached obj

Dynamic Client Server Relationship

The client and the server

Subcontract

Subcontract mechanism simplifies the client server relationship. CLient side stub generation is simplified, and subcontract is responsible for details.

The client stub, when needing to marshal, it will contact the subcontract and then handle those details of the mechanism. The server side allows the subcontract to create, revoke, or process.

All of the magic happens in the subcontract system.

Java RMI

Java Distributed Object Model

Remote Object

Accessible from different address spaces

Remote Interface

Declarations for methods in a remote object

Failure Semantics

Clients deal with RMI exceptions

Similarities / Differences to local objects

Object references can be params
Param only as value / result

Reuse of Local Implementation

She can reuse an existing local implementation by creating remote interface so that the implementer is in the local implementation. The implementer makes instances of objects remotely accessible => not preferable

Reuse of Remote

Using remote allows the java magic to make the bank account implementation visible publicly. Java RMI does the heavy lifting to make server object visible to network clients => preferable

RMI Implementation - RRL

The Remote reference layer is responsible for unmarshaling between client stubs and server skeletons. The RRL serializes and deserializes between the two. Similar to subcontract, the RRL handles any specific details about which server, how the server is implementation, where the server is, etc.

RMI Implementation - Transport

Endpoint

Protection Domain
Table of Remote Object

Connection Management

Setup, teardown, listen
liveness monitoring
Choice of export

RRL decides right export to use

I/O onc hannel uising connections

Enterprise Java beans

N-Tier Applications

Each layer in a app has to worry about persistence, transactions, caching, clustering, and security.

Distributed Subsystem

Global Memory Systems

Context for GMS

Regularly, the virtual memory manager is supposed to give the illusion that the working set is much more accessible than it actually is. The memory pressure may differ from nodes. Some may be busy and other may be idle

Is it possible to use memory of another process on a node in a LAN system? As technology has gotten quicker, it is possible. to quickly use cluster memory to page across the network. GMS we look at it reading across the cluster.

Basics

“Cache” refers to physical memory (DRAM) not processor cache

Sense of community to handle page faults at. anode

Physical memory. ofa node

Local ← Working set
Global ← Spare memory

Handling page faults

Case 1:

Common Case:

Page fault for X on node P
Hit in global cache of some node Q

If there was. apage fault, that means. th local memory pressure begins increasing. Then it will take the oldest page and send it over to the other host. But Host Q remains unchanged, becuase it sends host P a page.

Case 2:

Common Case with memory Pressure at P:

Page fault for. xon node p
Swap LRU page Y for X

Host P will send out the oldest page to Q and get a new page from Global X from Host Q.

Case 3:

Faulting Page on disk:

Page fault for x on node P
Page not in cluster

Host P global page shrinks and local increases. Host P grabs memory from the disk. Send to host with globally oldest page, and then he can throw it on the disk or throw it out.

Case 4:

Faulting Page Actively Shared:

Page fault for X on node P
Page in some peer node Q’s local cache

Host P takes from Host Q local, Host P has to throw. a global into Host R, where it will choose a victim to throw onto the disk.

Behavior of Algorithm

A completely idle node becomes. amemory serverf for peers on the cluster

Geriatrics

Management work needs to be distributed

Epoch parameters

T Max Duration
M max replacements

At the start of each epoch

send age into initiator
receive min age what are the oldest m pages?

m pages that are old are candidates for replacement in the next epoch.

Action at. anode or page fault page y eviction candidate

Age(page y) > min age => discard

Age(page y) < min age => send to peer node

Think Global - Act Local!

Implementation in Unix

OSF1 is the OS used by researcher building this.

Modify the memory manager to use the GMS to do the work to check if its in the remote GMS or disk or whateve.r

GMS is directly integrated and handles the communication betwenel ocal and global stuff.

Command Palette

OS Structures

SPIN Approach

What are we shooting for in an OS structure?

Approaches to Extensibility

Logical Protection Domains

Customized OS with Spin

Spin Mechanisms for Events

Default Core Services in SPIN

Conclusion

Exokernel Approach

Examples of Candidate Resources

Implementing Secure Bindings

Default Core Services in Exokernel

Memory Management using S-TLB

Default Core Services in Exokernel (cont.)

Revocation of Resources

Code Usage by Exokernel (Examples)

Putting it all Together

Exokernel Data Structures

Performance Results of SPIN and Exokernel

L3 Microkernel Approach

Strikes Against Microkernel

Debunking MKernel Myths with L3

User Kernel Border Crossing

Address Space Switches

Thread Switches and IPC

Memory effects

Why was border crossing in Mach so expensive?

Thesis of L3 of OS Structuring

Papers

Virtualization

Introduction to Virtualization

Platform Virtualization

Utility Computing

Hypervisors

Connecting the Dots

Full Virtualization

Para Virtualization

Big Picture

Memory Virtualization

Memory Hierarchy

Memory Subsystem Recall

Memory Management and Hypervisor

Memory Manager Zoomed Out

Zooming Back In

Shadow Page Table

Efficient Mapping (Full Virtualization)

Efficient Mapping (Para Virtualization)

Dynamically Increasing Memory

Ballooning

Sharing Memory Across Virtual Machines

VM Oblivious Page Sharing

Memory Allocation Policies

CPU & Device Virtualization

CPU Virtualization

Device Virtualization

Control Transfer

Data Transfer

Control and Data Transfer in Action

Disk I/O Virtualization

Measuring Time

Xen and Guests

Papers

Parallel Computing

Shared Memory Machines

Shared Memory and Caches

Memory Consistency Model

Hardware Cache Coherence

Scalability

Synchronization

Summary

Synchronization Primitives

Atomic Operations

Scalability Issues with Synchronization

Naive Spinlock

Caching Spinlock

Spinlocks with Delay

Ticket Lock

Spinlock Summary