
Nowadays, many-core platforms are the de-facto reference architectures for current and future processors. This is mainly due to physical limitations that do not allow to improve the sequential performance of an individual CPU-core.
Contextually, such a scaled up hardware parallelism has given rise to innovative coordination mechanisms that overstep the classical ones based on lock-protected critical sections.
In particular, non-blocking coordination algorithms make threads advance their computation regardless of whether other threads are blocked or faulty, thus providing definitely improved scalability in the management of shared data under high-contention.
However, literature results have demonstrated that, given a coordination problem, non-blocking algorithms solving it do not provide a one-size-fit-all solution, e.g., given that they may induce retries of specific algorithmic steps or because of algorithm complexity aspects.
In this context, our research objective will be the to design and implement non-blocking coordination algorithms which can be able to adapt their run-time behavior according to the workload profile, the variable intensity of shared-data access contention, and the peculiarities of the underlying hardware platform (such as the organization of the memory hierarchy).
The experimental assessment of our achievements will be carried out along two main directions: (i) by using several kinds of concurrent applications as test-beds (HPC instances, OSes, hypervisors, soft real-time systems); (ii) by testing our solutions on different platforms (multi-core, many-core, GPU, cloud-based).
During our research, we will develop techniques aimed at standing as best-practices for designing efficient non-blocking solutions, according to an adaptive paradigm, and formal methodologies to prove their correctness.
Consequently, studying the tradeoff between progress, correctness and scalability will result in an increased knowledge of concurrent algorithms, producing a set of theoretical and empirical tools which will help developers to choose and optimize synchronization mechanisms.
Moreover, we will devise new algorithms targeting particular use-case problems (such as memory allocators, events pool management), which will represent steps towards the autonomic fine-tuning paradigm of synchronization protocols, and, at the same time, may be used as solutions in any parallel program that should be scalable in face of varying workloads.
We will distribute our solutions as open source libraries paired with documentation and instructions on how to compile/use them.
This will enable the exploitation of the achieved results by any user and for any step forward in the research field, since the outcomes can be repeated/analyzed by the scientific community. Also, we will rely only on hardware capabilities offered by off-the-shelf processors thus making our results applicable on the largest variety of platforms.
We also note that open source releases of our solutions, and their tailoring for off-the-shelf platforms, will allow developers to use the outcomes of our research without the requirement of being computer scientists and to increase the scalability of their parallel software independently of the specific application domain, since improvements are achieved at the level of basic coordination tasks.
Aside from improving the knowledge of non-blocking algorithms, the adaptive synchronization paradigm resulting from our research will further contribute to the understanding of shared memory systems, impacting the efficiency of parallel applications and how software is designed and developed.
In fact, a software engineer can design multiple solutions for a synchronization problem that can be combined together automatically, without releasing multiple versions of the same application targeting different levels of parallelism, e.g. in the underlying hardware architectures.
This can reduce the cost of software development in terms of time required to ensure high performance and scalability on different types of architectures. Furthermore, it can definitely reduce the time required for software correctness verification thanks to the fact that an adaptive combination of, e.g., linearizable coordination algorithms will automatically guarantee linearizability.
Moreover, every software exploiting our solutions can take advantage of an improved scalability on the number of computing cores in terms of either processing faster a given problem (strong scalability) or processing a larger problem in the same time (weak scalability).
Additionally, once the level of hardware parallelism is fixed, the software throughput can be increased, thus providing an inherently improved energy efficiency thanks to the reduction of clock cycles spent while running steps of the synchronization protocol.
Such energy savings will also lead to increased margins of profit when delivering services hosted by parallel hardware, such as cloud infrastructures.
At the same time, applications running on cloud services can lead to savings thanks to the improved scalability which allows to reduce the on-line time for rented virtual nodes; further, the adaptivity of synchronization protocols allows to easily exploit new virtual cores added on demand.
As an example, in [10] we show that our non-blocking pending event set makes a parallel discrete event simulator run 5 times faster than a blocking calendar queue. If we had to pay for the used computing time, then we would have saved 5 times the amount spent with a classical approach.
On the contrary, a "simulation as a service'' provider could handle five customers instead of one during the execution of a simulation based on a blocking pending event set.