Restoring the Broken Covenant Between Compilers and Deep Learning
  Accelerators

By: Sean Kinzer, Soroush Ghodrati, Rohan Mahapatra, Byung Hoon Ahn, Edwin Mascarenhas, Xiaolong Li, Janarbek Matai, Liang Zhang, Hadi Esmaeilzadeh

Deep learning accelerators address the computational demands of Deep Neural Networks (DNNs), departing from the traditional Von Neumann execution model. They leverage specialized hardware to align with the application domain's structure. Compilers for these accelerators face distinct challenges compared to those for general-purpose processors. These challenges include exposing and managing more micro-architectural features, handling softwar... more
Deep learning accelerators address the computational demands of Deep Neural Networks (DNNs), departing from the traditional Von Neumann execution model. They leverage specialized hardware to align with the application domain's structure. Compilers for these accelerators face distinct challenges compared to those for general-purpose processors. These challenges include exposing and managing more micro-architectural features, handling software-managed scratch pads for on-chip storage, explicitly managing data movement, and matching DNN layers with varying hardware capabilities. These complexities necessitate a new approach to compiler design, as traditional compilers mainly focused on generating fine-grained instruction sequences while abstracting micro-architecture details. This paper introduces the Architecture Covenant Graph (ACG), an abstract representation of an architectural structure's components and their programmable capabilities. By enabling the compiler to work with the ACG, it allows for adaptable compilation workflows when making changes to accelerator design, reducing the need for a complete compiler redevelopment. Codelets, which express DNN operation functionality and evolve into execution mappings on the ACG, are key to this process. The Covenant compiler efficiently targets diverse deep learning accelerators, achieving 93.8% performance compared to state-of-the-art, hand-tuned DNN layer implementations when compiling 14 DNN layers from various models on two different architectures. less
BitUP: Efficient Bitmap Data Storage Solution For User Profile

By: Derong Tang, Hank Wang

User profile is widely used in the internet consumer industry, it can be used in recommendation systems for better user experience, or improving Ads system with better conversion rate. Most internet situation we must met large scale data set, thus retrieve efficient and store with less space became a challenge, how to handle trillions rows of data is very common in our business scene, so we proposed a novel solution called BitUP, involved t... more
User profile is widely used in the internet consumer industry, it can be used in recommendation systems for better user experience, or improving Ads system with better conversion rate. Most internet situation we must met large scale data set, thus retrieve efficient and store with less space became a challenge, how to handle trillions rows of data is very common in our business scene, so we proposed a novel solution called BitUP, involved the new distributed bitmap structure to efficient store profile label data, and can querying with better performance, the fundamental structure is bitmap, that is why our method is efficient and less storage overhead. Our design can also scale linearly, to demonstrate the scalability of the proposed solution we show retrieval efficiency in various industry data, which scales across from terabytes to petabytes. less
Sui Lutris: A Blockchain Combining Broadcast and Consensus

By: Same Blackshear, Andrey Chursin, George Danezis, Anastasios Kichidis, Lefteris Kokoris-Kogias, Xun Li, Mark Logan, Ashok Menon, Todd Nowacki, Alberto Sonnino, Brandon Williams, Lu Zhang

Sui Lutris is the first smart-contract platform to sustainably achieve sub-second finality. It achieves this significant decrease in latency by employing consensusless agreement not only for simple payments but for a large variety of transactions. Unlike prior work, Sui Lutris neither compromises expressiveness nor throughput and can run perpetually without restarts. Sui Lutris achieves this by safely integrating consensuless agreement with... more
Sui Lutris is the first smart-contract platform to sustainably achieve sub-second finality. It achieves this significant decrease in latency by employing consensusless agreement not only for simple payments but for a large variety of transactions. Unlike prior work, Sui Lutris neither compromises expressiveness nor throughput and can run perpetually without restarts. Sui Lutris achieves this by safely integrating consensuless agreement with a high-throughput consensus protocol that is invoked out of the critical finality path but makes sure that when a transaction is at risk of inconsistent concurrent accesses its settlement is delayed until the total ordering is resolved. Building such a hybrid architecture is especially delicate during reconfiguration events, where the system needs to preserve the safety of the consensusless path without compromising the long-term liveness of potentially misconfigured clients. We thus develop a novel reconfiguration protocol, the first to show the safe and efficient reconfiguration of a consensusless blockchain. Sui Lutris is currently running in production as part of a major smart-contract platform. Combined with the Move Programming language it enables the safe execution of smart contracts that expose objects as a first-class resource. In our experiments Sui Lutris achieves latency lower than 0.5 seconds for throughput up to 5,000 certificates per second (150k ops/s with bundling), compared to the state-of-the-art real-world consensus latencies of 3 seconds. Furthermore, it gracefully handles validators crash-recovery and does not suffer visible performance degradation during reconfiguration. less
Approaches to Conflict-free Replicated Data Types

By: Paulo Sérgio Almeida

Conflict-free Replicated Data Types (CRDTs) allow optimistic replication in a principled way. Different replicas can proceed independently, being available even under network partitions, and always converging deterministically: replicas that have received the same updates will have equivalent state, even if received in different orders. After a historical tour of the evolution from sequential data types to CRDTs, we present in detail the tw... more
Conflict-free Replicated Data Types (CRDTs) allow optimistic replication in a principled way. Different replicas can proceed independently, being available even under network partitions, and always converging deterministically: replicas that have received the same updates will have equivalent state, even if received in different orders. After a historical tour of the evolution from sequential data types to CRDTs, we present in detail the two main approaches to CRDTs, operation-based and state-based, including two important variations, the pure operation-based and the delta-state based. Intended as a tutorial for prospective CRDT researchers and designers, it provides solid coverage of the essential concepts, clarifying some misconceptions which frequently occur, but also presents some novel insights gained from considerable experience in designing both specific CRDTs and approaches to CRDTs. less
A Survey on Experimental Performance Evaluation of Data Distribution
  Service (DDS) Implementations

By: Kaleem Peeroo, Peter Popov, Vladimir Stankovic

The Data Distribution Service (DDS) is a widely used communication specification for real-time mission-critical systems that follow the principles of publish-subscribe middleware. DDS has an extensive set of quality of service (QoS) parameters allowing a thorough customisation of the intended communication. An extensive survey of the performance of the implementations of this communication middleware is lacking. This paper closes the gap by... more
The Data Distribution Service (DDS) is a widely used communication specification for real-time mission-critical systems that follow the principles of publish-subscribe middleware. DDS has an extensive set of quality of service (QoS) parameters allowing a thorough customisation of the intended communication. An extensive survey of the performance of the implementations of this communication middleware is lacking. This paper closes the gap by surveying the state of the art in performance of various DDS implementations and identifying any research gaps that exist within this domain. less
PartRePer-MPI: Combining Fault Tolerance and Performance for MPI
  Applications

By: Sarthak Joshi, Sathish Vadhiyar

As we have entered Exascale computing, the faults in high-performance systems are expected to increase considerably. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a much higher frequency resulting in an excessive amount of overhead which would not be sustainable for many scientific applications. Replication allows for fast recovery from failures by simply dropping the ... more
As we have entered Exascale computing, the faults in high-performance systems are expected to increase considerably. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a much higher frequency resulting in an excessive amount of overhead which would not be sustainable for many scientific applications. Replication allows for fast recovery from failures by simply dropping the failed processes and using their replicas to continue the regular operation of the application. In this paper, we have implemented PartRePer-MPI, a novel fault-tolerant MPI library that adopts partial replication of some of the launched MPI processes in order to provide resilience from failures. The novelty of our work is that it combines both fault tolerance, due to the use of the User Level Failure Mitigation (ULFM) framework in the Open MPI library, and high performance, due to the use of communication protocols in the native MPI library that is generally fine-tuned for specific HPC platforms. We have implemented efficient and parallel communication strategies with computational and replica processes, and our library can seamlessly provide fault tolerance support to an existing MPI application. Our experiments using seven NAS Parallel Benchmarks and two scientific applications show that the failure-free overheads in PartRePer-MPI when compared to the baseline MVAPICH2, are only up to 6.4% for the NAS parallel benchmarks and up to 9.7% for the scientific applications. less
Efficient Serverless Function Scheduling at the Network Edge

By: Lou Jiong, Tang Zhiqing, Yuan Shijing, Li Jie, Chengtao Wu

Serverless computing is a promising approach for edge computing since its inherent features, e.g., lightweight virtualization, rapid scalability, and economic efficiency. However, previous studies have not studied well the issues of significant cold start latency and highly dynamic workloads in serverless function scheduling, which are exacerbated at the resource-limited network edge. In this paper, we formulate the Serverless Function Sche... more
Serverless computing is a promising approach for edge computing since its inherent features, e.g., lightweight virtualization, rapid scalability, and economic efficiency. However, previous studies have not studied well the issues of significant cold start latency and highly dynamic workloads in serverless function scheduling, which are exacerbated at the resource-limited network edge. In this paper, we formulate the Serverless Function Scheduling (SFS) problem for resource-limited edge computing, aiming to minimize the average response time. To efficiently solve this intractable scheduling problem, we first consider a simplified offline form of the problem and design a polynomial-time optimal scheduling algorithm based on each function's weight. Furthermore, we propose an Enhanced Shortest Function First (ESFF) algorithm, in which the function weight represents the scheduling urgency. To avoid frequent cold starts, ESFF selectively decides the initialization of new function instances when receiving requests. To deal with dynamic workloads, ESFF judiciously replaces serverless functions based on the function weight at the completion time of requests. Extensive simulations based on real-world serverless request traces are conducted, and the results show that ESFF consistently and substantially outperforms existing baselines under different settings. less
AdaMEC: Towards a Context-Adaptive and Dynamically-Combinable DNN
  Deployment Framework for Mobile Edge Computing

By: Bowen Pang, Sicong Liu, Hongli Wang, Bin Guo, Yuzhan Wang, Hao Wang, Zhenli Sheng, Zhongyi Wang, Zhiwen Yu

With the rapid development of deep learning, recent research on intelligent and interactive mobile applications (e.g., health monitoring, speech recognition) has attracted extensive attention. And these applications necessitate the mobile edge computing scheme, i.e., offloading partial computation from mobile devices to edge devices for inference acceleration and transmission load reduction. The current practices have relied on collaborativ... more
With the rapid development of deep learning, recent research on intelligent and interactive mobile applications (e.g., health monitoring, speech recognition) has attracted extensive attention. And these applications necessitate the mobile edge computing scheme, i.e., offloading partial computation from mobile devices to edge devices for inference acceleration and transmission load reduction. The current practices have relied on collaborative DNN partition and offloading to satisfy the predefined latency requirements, which is intractable to adapt to the dynamic deployment context at runtime. AdaMEC, a context-adaptive and dynamically-combinable DNN deployment framework is proposed to meet these requirements for mobile edge computing, which consists of three novel techniques. First, once-for-all DNN pre-partition divides DNN at the primitive operator level and stores partitioned modules into executable files, defined as pre-partitioned DNN atoms. Second, context-adaptive DNN atom combination and offloading introduces a graph-based decision algorithm to quickly search the suitable combination of atoms and adaptively make the offloading plan under dynamic deployment contexts. Third, runtime latency predictor provides timely latency feedback for DNN deployment considering both DNN configurations and dynamic contexts. Extensive experiments demonstrate that AdaMEC outperforms state-of-the-art baselines in terms of latency reduction by up to 62.14% and average memory saving by 55.21%. less
Mysticeti: Low-Latency DAG Consensus with Fast Commit Path

By: Kushal Babel, Andrey Chursin, George Danezis, Lefteris Kokoris-Kogias, Alberto Sonnino

We introduce Mysticeti-C a byzantine consensus protocol with low-latency and high resource efficiency. It leverages a DAG based on Threshold Clocks and incorporates innovations in pipelining and multiple leaders to reduce latency in the steady state and under crash failures. Mysticeti-FPC incorporates a fast commit path that has even lower latency. We prove the safety and liveness of the protocols in a byzantine context. We evaluate Mystice... more
We introduce Mysticeti-C a byzantine consensus protocol with low-latency and high resource efficiency. It leverages a DAG based on Threshold Clocks and incorporates innovations in pipelining and multiple leaders to reduce latency in the steady state and under crash failures. Mysticeti-FPC incorporates a fast commit path that has even lower latency. We prove the safety and liveness of the protocols in a byzantine context. We evaluate Mysticeti and compare it with state-of-the-art consensus and fast path protocols to demonstrate its low latency and resource efficiency, as well as more graceful degradation under crash failures. Mysticeti is the first byzantine protocol to achieve WAN latency of 0.5s for consensus commit, at a throughput of over 50k TPS that matches the state-of-the-art. less
Overview of Caching Mechanisms to Improve Hadoop Performance

By: Rana Ghazali, Douglas G. Down

Nowadays distributed computing environments, large amounts of data are generated from different resources with a high velocity, rendering the data difficult to capture, manage, and process within existing relational databases. Hadoop is a tool to store and process large datasets in a parallel manner across a cluster of machines in a distributed environment. Hadoop brings many benefits like flexibility, scalability, and high fault tolerance;... more
Nowadays distributed computing environments, large amounts of data are generated from different resources with a high velocity, rendering the data difficult to capture, manage, and process within existing relational databases. Hadoop is a tool to store and process large datasets in a parallel manner across a cluster of machines in a distributed environment. Hadoop brings many benefits like flexibility, scalability, and high fault tolerance; however, it faces some challenges in terms of data access time, I/O operation, and duplicate computations resulting in extra overhead, resource wastage, and poor performance. Many researchers have utilized caching mechanisms to tackle these challenges. For example, they have presented approaches to improve data access time, enhance data locality rate, remove repetitive calculations, reduce the number of I/O operations, decrease the job execution time, and increase resource efficiency. In the current study, we provide a comprehensive overview of caching strategies to improve Hadoop performance. Additionally, a novel classification is introduced based on cache utilization. Using this classification, we analyze the impact on Hadoop performance and discuss the advantages and disadvantages of each group. Finally, a novel hybrid approach called Hybrid Intelligent Cache (HIC) that combines the benefits of two methods from different groups, H-SVM-LRU and CLQLMRS, is presented. Experimental results show that our hybrid method achieves an average improvement of 31.2% in job execution time. less