## News

### Ph.D. Defence

Tobias Vincon has successfully defended his dissertation "Data-Intensive Systems on Modern Hardware: leveraging near-data processing to counter the growth of data".

### Summer School

Arthur Bernhard attends the st Summer School on Scalable Data Management for Future Hardware

### BW-CAR Membership

Ilia Petrov has been elected a member of the Baden-Württemberg Center of Applied Research (BW-CAR) and is a founding member of the doctoral consortium (Promotionsverband) Baden-Württemberg

### Paper Accepted at VLDB

T. Vincon, C. Knoedler, L. Solis-Vasquez, A. Bernhardt, S. Tamimi, L. Weber, F. Stock, A. Koch, I. Petrov: Near-Data Processing in Database Systems on Native Computational Storage under HTAP Workloads. In Proc. VLDB 2022.

In this paper we show that Near-Data Processing (NDP) naturally fits in the HTAP design space. We propose an architecture for update-aware NDP, allowing transactionally consistent in-situ executions of analytical operations in presence of concurrent updates in HTAP settings.

Abstract:

In this paper we show that Near-Data Processing (NDP) naturally fits in the HTAP design space. We propose an architecture for update-aware NDP, allowing transactionally consistent in-situ executions of analytical operations in presence of concurrent updates in HTAP settings.

### Paper Accepted VLDBJ

T. Bang, N. May, I. Petrov, C. Binnig. The full story of 1000 cores. In VLDB Journal (2022).

In this paper, we further extend our analysis from DaMoN 2020, detailing the effect of hardware and workload characteristics via additional real hardware platforms (IBM Power8 and 9) and the full TPC-C transaction mix.

### Paper Accepted EDBT

B. Moessner, C. Riegger, A. Bernhardt, I. Petrov. bloomRF: On Performing Range-Queries in Bloom-Filters with Piecewise-Monotone Hash Functions and Prefix Hashing. In Proc. EDBT 2023.

We introduce bloomRF as a unified point-range filter that extends Bloom-filters with range-queries.

C. Riegger, I. Petrov. Storage Management with Multi-Version Partitioned BTrees. In Proc. ADBIS 2022.

We propose Multi-Version Partitioned BTrees (MV-PBT) as sole storage and index management structure in key-sorted storage engines like K/V-Stores.

Abstract:

Database Management Systems and K/V-Stores operate on updatable datasets â massively exceeding the size of available main mem- ory. Tree-based K/V storage management structures became particularly popular in storage engines. B+-Trees [1,4] allow constant search perfor- mance, however write- heavy workloads yield in inefficient write patterns to secondary storage devices and poor performance characteristics. LSM- Trees overcome this issue by horizontal partitioning fractions of data â small enough to fully reside in main memory, but require frequent maintenance to sustain search performance. Firstly, we propose Multi-Version Partitioned BTrees (MV-PBT) as sole storage and index management structure in key-sorted storage engines like K/V-Stores. Secondly, we compare MV-PBT against LSM-Trees. The logical horizontal partitioning in MV-PBT allows leveraging recent advances in modern B+-Tree techniques in a small transparent and mem- ory resident portion of the structure. Structural properties sustain steady read performance, yielding efficient write patterns and reducing write amplification. We integrated MV-PBT in the WiredTiger KV storage engine. MV- PBT offers an up to 2x increased steady throughput in comparison to LSM-Trees and several orders of magnitude in comparison to B+-Trees in a YCSB workload.

### Paper Accepted at DAMON

T. Vincon, C. Knoedler, A. Bernhardt, L. Solis-Vasquez, L. Weber, A. Koch, I.Petrov. Result-Set Management for NDP Operations on Smart Storage. In Proc. DAMON 2022.

In this work, we introduce a set of in-situ NDP result-set management techniques, such as spilling, materialization, and reuse.

Abstract:

### Paper Accepted at FCCM 2022

S. Tamimi, F. Stock, A. Bernhardt, I. Petrov, A. Koch. An Evaluation of Using CCIX for Cache-Coherent Host-FPGA Interfacing. In Proc. FCCM 2022.

In this work, we compare-and-contrast the use of CCIX with PCIe when interfacing an ARM-based host with two generations of CCIX-enabled FPGAs. We provide both low-level throughput and latency measurements for accesses and address translation, as well as examine an application-level use-case of using CCIX for fine-grained synchronization in an FPGA-accelerated database system

Abstract:

For a long time, most discrete accelerators have been attached to host systems using various generations of the PCI Express interface. However, with its lack of support for coherency between accelerator and host caches, fine-grained interactions require frequent cache-flushes, or even the use of inefficient uncached memory regions. The Cache Coherent Interconnect for Accelerators (CCIX) was the first multi-vendor standard for enabling cache-coherent host-accelerator attachments, and already is indicative of the capabilities of upcoming standards such as Compute Express Link (CXL). In our work, we compare-and-contrast the use of CCIX with PCIe when interfacing an ARM-based host with two generations of CCIX-enabled FPGAs. We provide both low-level throughput and latency measurements for accesses and address translation, as well as examine an application-level use-case of using CCIX for fine-grained synchronization in an FPGA-accelerated database system. We can show that especially smaller reads from the FPGA to the host can benefit from CCIX by having roughly 33% shorter latency than PCIe. Small writes to the host have a latency roughly 32% higher than PCIe, though, since they carry a higher coherency overhead. For the database use-case, the use of CCIX allowed to maintain a constant synchronization latency even with heavy host-FPGA parallelism..

### Paper Accepted at ICDE

A. Bernhardt, S. Tamimi, T. Vincon, C. Knoedler, F. Stock, C Heinz, A. Koch, I. Petrov: neoDBMS: In-situ Snapshots for Multi-Version DBMS on Native Computational Storage In Proc. ICDE 2022.

In this paper, we showcase how neoDBMS performs snapshot computation in-situ.

Abstract:

Multi-versioning and MVCC are the foundations of many modern DBMSs. Under mixed workloads and large datasets, the creation of the transactional snapshot can become very expensive, as long-running analytical transactions may request old versions, residing on cold storage, for reasons of transactional consistency. Furthermore, analytical queries operate on cold data, stored on slow persistent storage. Due to the poor data locality, snapshot creation may cause massive data transfers and thus lower performance. Given the current trend towards computational storage and near-data processing, it has become viable to perform such operations in-storage to reduce data transfers and improve scalability. neoDBMS is a DBMS designed for near-data processing and computational storage. In this paper, we demonstrate how neoDBMS performs snapshot computation in-situ. We showcase different interactive scenarios, where neoDBMS outperforms PostgreSQL 12 by up to 5x.

### Paper Accepted at EDBT

A. Bernhardt, S. Tamimi, F. Stock, A. Koch, T. Vincon, I. Petrov: Cache-Coherent Shared Locking for Transactionally Consistent Updates in Near-Data Processing DBMS on Smart Storage. In Proc. EDBT 2022.

We introduce a low-latency cache-coherent shared lock table for update NDP settings. It utilizes the novel CCIX interconnect technology and is integrated in neoDBMS, a near-data processing DBMS for smart storage

Abstract:

Even though near-data processing (NDP) can provably reduce data transfers and increase performance, current NDP is solely utilized in read-only settings. Slow or tedious to implement synchronization and invalidation mechanisms between host and smart storage makeNDP support for data-intensive update operations difficult. In this paper, we introduce a low-latency cache-coherent shared lock table for update NDP settings. It utilizes the novel CCIX interconnect technology and is integrated in neoDBMS, a near-data processing DBMS for smart storage. Our evaluation indicates end-to-end lock latencies of â¼80-100ns and robust performance under contention.

### HardBD/Active 2022

DBlab is co-organissing this year's HardBD/Active Workshop at ICDE 2022. HardBD/Active 2022 has a very strong programme. Follow it online and be part of it!

Both HardBD and Active are interested in exploiting hardware technologies for data-intensive systems. The workshop aims at providing a forum for academia and industry to exchange ideas through research and position papers.

Abstract:

### New Paper Accepted

Christian Knoedler, Tobias Vincon, Arthur Bernhardt, Leonardo Solis-Vasquez, Lukas Weber, Ilia Petrov, Andreas Koch. A cost model for NDP-aware query optimization for KV-stores. In Proc. DAMON 2021.

We show the need for optimisations based on a cost model due to the usage of the traditional stack as well as the upcoming execution on computational storage.

Abstract:

Many modern DBMS architectures require transferring data from storage to process it afterwards. Given the continuously increasing amounts of data, data transfers quickly become a scalability limiting factor. Near-Data Processing and smart/computational storage emerge as promising trends allowing for decoupled in-situ operation execution, data transfer reduction and better bandwidth utilization. However, not every operation is suitable for an in-situ execution and a careful placement and optimization is needed. In this paper we present an NDP-aware cost model. It has been implemented in MySQL and evaluated with nKV. We make several observations underscoring the need for optimization.

### Paper Accepted at RAW@IPDPS

Lukas Weber, Lukas Sommer, Leonardo Solis-Vasquez, Tobias Vincon, Christian Knoedler, Arthur Bernhardt, Ilia Petrov, Andreas Koch. A Framework for the Automatic Generation of FPGA-based Near-Data Processing Accelerators in Smart Storage Systems. In Proc. Reconfigurable Architectures Workshop. RAW@IPDPS.

We introduce a framework for automatic generation of data format parsers and accessors for NDP DBMS on computational storage

Abstract:

Near-Data Processing is a promising approach to overcome the limitations of slow I/O interfaces in the quest to analyze the ever-growing amount of data stored in database systems. Next to CPUs, FPGAs will play an important role for the realization of functional units operating close to data stored in non-volatile memories such as Flash. It is essential that the NDP-device understands formats and layouts of the persistent data, to perform operations in-situ. To this end, carefully optimized format parsers and layout accessors are needed. However, designing such FPGA-based Near-Data Processing accelerators requires significant effort and expertise. To make FPGA-based Near-Data Processing accessible to non-FPGA experts, we will present a framework for the automatic generation of FPGA-based accelerators capable of data filtering and transformation for key-value stores based on simple data-format specifications. The evaluation shows that our framework is able to generate accelerators that are almost identical in performance compared to the manually optimized designs of prior work, while requiring little to no FPGA-specific knowledge and additionally providing improved flexibility and more powerful functionality.

### Paper Accepted at DAPD

L Weber, T. Vincon, C Knoedler, L. Solis-Vasquez, A. Bernhardt, I. Petrov, A. Koch. On the Necessity of Explicit Cross-Layer Data Formats in Near-Data Processing Systems. In Journal of Distributed and Parallel Databases. DAPD.

The NDP-style processing requires an explicit definition of cross-layer data formats and accessors to ensure in-situ executions optimally utilizing the properties of the underlying NDP storage and compute elements.

Abstract:

Massive data transfers in modern data-intensive systems resulting from low data-locality and data-to-code system design hurt their performance and scalability. Near-Data processing (NDP) and a shift to code-to-data designs may represent a viable solution as packaging combinations of storage and compute elements on the same device has become feasible. The shift towards NDP system architectures calls for revision of established principles. Abstractions such as data formats and layouts typically spread multiple layers in traditional DBMS, the way they are processed is encapsulated within these layers of abstraction. The NDP-style processing requires an explicit definition of cross-layer data formats and accessors to ensure in-situ executions optimally utilizing the properties of the underlying NDP storage and compute elements. In this paper, we make the case for such data format definitions and investigate the performance benefits under RocksDB and the COSMOS hardware platform.

### New Preprint Available

Christian Riegger, Arthur Bernhardt, Bernhard Moessner, Ilia Petrov.
bloomRF: On Performing Range-Queries with Bloom-Filters based on Piecewise-Monotone Hash Functions and Dyadic Trace-Trees. [arXiv].

bloomRF is a unified method for approximate membership testing that can efficiently perform both point- and range-queries on a single data structure.

Abstract:

We introduce bloomRF as a unified method for approximate membership testing that supports both point- and range-queries on a single data structure. bloomRF extends Bloom-Filters with range query support and may replace them. The core idea is to employ a dyadic interval scheme to determine the set of dyadic intervals covering a data point, which are then encoded and inserted. bloomRF introduces Dyadic Trace-Trees as novel data structure that represents those covering intervals implicitly. A Trace-Tree encoding scheme represents the set of covering intervals efficiently, in a compact bit representation. Furthermore, bloomRF introduces novel piecewise-monotone hash functions that are locally order-preserving and thus support range querying. We present an efficient membership computation method for range-queries. Although, bloomRF is designed for integers it also supports string and floating-point data types. It can also handle multiple attributes and serve as multi-attribute filter. We evaluate bloomRF in RocksDB and in a standalone library. bloomRF is more efficient and outperforms existing point-range-filters by up to 4x across a range of settings.

### New DBlab Member

The DBlab team is happy welcome Christian Knoedler on board!

Arthur will strengthen our neoDBMS-Team.

### New Paper Accepted

T. Vincon, L. Weber, A. Bernhardt, C. Riegger, S. Hardock, C. Knoedler, F. Stock, L. Solis-Vasquez, S. Tamimi, A. Koch, I. Petrov. nKV in Action: Accelerating KV-Stores on Native Computational Storage with Near-Data Processing. In Proc. VLDB 2020.

In this paper we introduce nKV, which is a key/value store utilizing native computational storage and near-data processing.

Abstract:

Massive data transfers in modern data-intensive systems resulting from low data-locality and data-to-code system de- sign hurt their performance and scalability. Near-data processing (NDP) designs represent a feasible solution, which although not new, has yet to see widespread use. In this paper we demonstrate various NDP alternatives in nKV, which is a key/value store utilizing native computational storage and near-data processing. We showcase the execution of classical operations (GET, SCAN) and complex graph-processing algorithms (Betweenness Centrality) in-situ, with 1.4x-2.7x better performance due to NDP. nKV runs on real hardware - the COSMOS+ platform.

### New Paper Accepted

T. Bang, I. Oukid, N. May, I. Petrov, C. Binnig. Robust Performance of Main Memory Data Structures by Configuration. In Proc. SIGMOD 2020.

In this paper, we present a new approach for achieving robust performance of data structures making it easier to reuse the same design for different hardware generations but also for different workloads.

Abstract:

In this paper, we present a new approach for achieving robust performance of data structures making it easier to reuse the same design for different hardware generations but also for different workloads. To achieve robust performance, the main idea is to strictly separate the data structure design from the actual strategies to execute access operations and adjust the actual execution strategies by means of so-called configurations instead of hard-wiring the execution strategy into the data structure. In our evaluation we demonstrate the benefits of this configuration approach for individual data structures as well as complex OLTP workloads.

### New Paper Accepted

T. Bang, N. May, I. Petrov, C. Binnig. The Tale of 1000 Cores: An Evaluation of Concurrency Control on Real(ly) Large Multi-Socket Hardware In Proc. DAMON 2020.

We follow up on this prior work with an evaluation of the characteristics of concurrency control schemes on real production multi-socket hardware with 1568 cores.

Abstract:

In this paper, we set out the goal to revisit the results of âStarring into the Abyss [...] of Concurrency Control with [1000] Coresâ and analyse in-memory DBMSs on todayâs large hardware. Despite the original assumption of the authors, today we do not see single- socket PUs with 1000 cores. Instead multi-socket hardware made its way into production data centres. Hence, we follow up on this prior work with an evaluation of the characteristics of concurrency control schemes on real production multi-socket hardware with 1568 cores. To our surprise, we made several interesting findings which we report on in this paper.

### New Paper Accepted

T. Vincon, L. Weber, A. Bernhardt, A. Koch, I. Petrov. nKV: Near-Data Processing with KV-Stores on Native Computational Storage. In Proc. DAMON 2020.

In this paper we introduce nKV, which is a key/value store utilizing native computational storage and near-data processing.

Abstract:

Massive data transfers in modern key/value stores resulting from low data-locality and data-to-code system design hurt their performance and scalability. Near-data processing (NDP) designs represent a feasible solution, which although not new, have yet to see widespread use. In this paper we introduce nKV, which is a key/value store utilizing native computational storage and near-data process- ing. On the one hand, nKV can directly control the data and computation placement on the underlying storage hardware. On the other hand, nKV propagates the data formats and layouts to the storage device where, software and hardware parsers and accessors are implemented. Both allow NDP operations to execute in host-intervention-free manner, directly on physical addresses and thus better utilize the underlying hardware. Our performance evaluation is based on execut- ing traditional KV operations (GET, SCAN ) and on complex graph-processing algorithms (Betweenness Centrality) in-situ, with 1.4x-2.7x better performance on real hardware â the COSMOS+ platform.

### New Paper Accepted

T. Vincon, A. Bernhardt, L. Weber, A. Koch, I. Petrov. On the Necessity of Explicit Cross-Layer DataFormats in Near-Data Processing Systems. In Proc. HardBD 2020.

The NDP-style processing requires an explicit definition of cross-layer data formats and accessors to ensure in-situ executions optimally utilizing the properties of the underlying NDP storage and compute elements.

Abstract:

Massive data transfers in modern data-intensive systems resulting from low data-locality and data-to-code system design hurt their performance and scalability. Near-data processing (NDP) and a shift to code-to-data designs may represent a viable solution as packaging combinations of storage and compute elements on the same device has become viable. The shift towards NDP system architectures calls for revision of established principles. Abstractions such as data formats and layouts typically spread multiple layers in traditional DBMS, the way they are processed is encapsulated within these layers of abstraction. The NDP-style processing requires an explicit definition of cross-layer data formats and accessors to ensure in-situ executions optimally utilizing the properties of the underlying NDP storage and compute elements. In this paper, we make the case for such data format definitions and investigate the performance benefits under RocksDB and the COSMOS hardware platform.

### New Project Grant

pimDB: infrastructure for Processing-In-Memory in modern DBMS

Principle Investigators:
Data Management Lab

pimDB provides infrastructures for PIM research in modern main-memory DBMS.

### New Paper Accepted

C. Riegger, T. Vincon, R. Gottstein, I. Petrov. MV-PBT: Multi-Version Indexing for Large Datasets and HTAP Workloads. In Proc. EDBT 2020.

MV-PBT is a version-aware index structure for HTAP workloads, supporting index-only visibility-checks and flash-friendly I/O patterns.

Abstract:

Modern mixed (HTAP) workloads execute fast update-transactions and long-running analytical queries on the same dataset and system. In multi-version (MVCC) systems, such workloads result in many short-lived versions and long version-chains as well as in increased and frequent maintenance overhead. Consequently, the index pressure increases significantly. Firstly, the frequent modifications cause frequent creation of new versions, yielding a surge in index maintenance overhead. Secondly and more importantly, index-scans incur extra I/O overhead to determine, which of the resulting tuple-versions are visible to the executing transaction (visibility-check) as current designs only store version/timestamp information in the base table â not in the index. Such index-only visibility-check is critical for HTAP workloads on large datasets. In this paper we propose the Multi-Version Partitioned B-Tree (MV-PBT) as a version-aware index structure, supporting index- only visibility checks and flash-friendly I/O patterns. The ex- perimental evaluation indicates a 2x improvement for analytical queries and 15% higher transactional throughput under HTAP workloads. MV-PBT offers 40% higher tx. throughput compared to WiredTigerâs LSM-Tree implementation under YCSB.

### New DBlab Member

The DBlab team is happy welcome Arthur Bernhardt on board!

Arthur will strengthen our neoDBMS-Team.

### New DFG Project Grant

neoDBMS: Hardware/Software Co-Design for Accelerated Near-Data Processing in Modern Database Systems

Principle Investigators:
Embedded Systems and Applications Group,

Data Management Lab, Reutlingen University
Funding agency: DFG

neoDBMS aims to explore new architectures, abstractions and algorithms for intelligent database storage capable of performing Near-Data Processing (NDP)and executing data- or compute-intensive DBMS operations in-situ.

Abstract:

With advances in semiconductor technologies, it has nowadays become economical to produce combinations of modern semiconductor storage (e.g., Non-volatile Memories) and powerful compute-units (FPGA, GPU, many-core CPUs) co-located on, or close to, the same device - yielding intelligent storage devices. Data movements have become a limiting factor in times of exponential data growth, since they are blocking, frequent, and impair scalability. However, existing solution approaches are mainly based on 40-year old architectures, following the paradigm of {\em transporting} data to the processing elements. This procedure has both time as well as energy penalties. The memory wall'' and the "von Neumann bottleneck" amplify the negative performance impact of those deficiencies. The present project proposal aims to explore new architectures, abstractions and algorithms for intelligent database storage capable of performing Near-Data Processing (NDP). We target intelligent storage devices, comprising Non-volatile Memories or next-generation 3D-DRAM (such as the HMC), as well as the use of FPGAs as computational-units. We intend to investigate the following research questions: 1) Support for NDP in update-environments and hybrid-workloads. 2) Support for NDP in DBMS on Non-volatile Memories and NDP-support for declarative data layouts. 3) NDP use of shared virtual memory.

### PIM Survey Published

T. Vincon, A. Koch, I. Petrov. Moving Processing to Data:On the Influence of Processing-in-Memory on Data Management. arXiv.

Near-Data Processing, ideally allows executing application-defined data- or compute-intensive operations in-situ, i.e. within (or close to) the physical data storage.

Abstract:

Near-Data Processing refers to an architectural hardware and software paradigm, based on the co-location of storage and compute units. Ideally, it will allow to execute application-defined data- or compute-intensive operations in-situ, i.e. within (or close to) the physical data storage. Thus, Near-Data Processing seeks to minimize expensive data movement, improving performance, scalability, and resource-efficiency. Processing-in-Memory is a sub-class of Near-Data processing that targets data processing directly within memory (DRAM) chips. The effective use of Near-Data Processing mandates new architectures, algorithms, interfaces, and development toolchains.

### nativeNDP: Processing Big Data Analytics on Native Storage Nodes

T. Vincon, S. Hardock, C. Riegger, A. Koch, I. Petrov. nativeNDP: Processing Big Data Analytics on Native Storage Nodes. In Proc. ADBIS 2019.

We propose nativeNDP â a framework for Near-Data Processing that pushes down primitive R tasks and executes them in-situ, directly within the storage device of a cluster-node.

Abstract:

Data analytics tasks on large datasets are computationally- intensive and often demand the compute power of cluster environments. Yet, data cleansing, preparation, dataset characterization and statistics or metrics computation steps are frequent. These are mostly performed ad hoc, in an explorative manner and mandate low response times. But, such steps are I/O intensive and typically very slow due to low data locality, inadequate interfaces and abstractions along the stack. These typically result in prohibitively expensive scans of the full dataset and transformations on interface boundaries. In this paper we examine R as analytical tool, managing large persis- tent datasets in Ceph, a wide-spread cluster file-system. We propose nativeNDP â a framework for Near-Data Processing that pushes down primitive R tasks and executes them in-situ, directly within the storage device of a cluster-node. Across a range of data sizes, we show that na- tiveNDP is more than an order of magnitude faster than other pushdown alternatives.

### Indexing large updatable Datasets in Multi-Version Database Management Systems

C. Riegger, T. Vinccon, I. Petrov. Indexing large updatable Datasets in Multi-Version Database Management Systems. In Proc. IDEAS 2019.

In this paper we present the implementation of Partitined B-Trees in PostgreSQL extended with SIAS.

Abstract:

Database Management Systems (DBMS) need to handle large updatable datasets under OLTP workloads. Most modern DBMS provide snapshots of data in MVCC transaction management scheme. Each transaction operates on a snapshot of the database. It is calculated from a set of tuple versions, containing logical transaction timestamps. This transaction management scheme enables high parallelism and resource-efficient append-only data placement on secondary storage. One major issue in indexing tuple versions on modern hardware technologies is the high write amplification for tree-indexes. Partitioned B-Trees (PBT) is based on the structure and algorithms of the ubiquitous B+-Tree. They achieve a near optimal write amplification and beneficial sequential writes on secondary storage. In this paper we present the implementation of PBTs in PostgreSQL extended with SIAS. Compared to PostgreSQL's standard B+-Trees PBTs have 50% better transactional throughput under TPC-C.

### IPA-IDX: In-Place Appends for B-Tree Indices

S. Hardock, A. Koch, T. Vinccon, I. Petrov. IPA-IDX: In-Place Appends for B-Tree Indices. In Proc. DaMoN 2019.

IPA-IDX is an approach to handle index modifications modern storage technologies (NVM, Flash) as physical in-place appends, using simplified physiological log records.

Paper Accepted at DaMoN 2019

S. Hardock, A. Koch, T. Vinccon, I. Petrov. IPA-IDX: In-Place Appends for B-Tree Indices. In Proc. DaMoN 2019.

Abstract:

We introduce IPA-IDX - an approach to handle index modifications modern storage technologies (NVM, Flash) as physical in-place appends, using simplified physiological log records. IPA-IDX provides similar performance and longevity advantages for indexes as basic IPA does for tables. The selective application of IPA-IDX and basic IPA to certain regions and objects, lowers the GC overhead by over 60%, while keeping the total space overhead to 2%. The combined effect of IPA and IPA-IDX increases performance by 28%.

### Native Storage Techniques for Data Management

I. Petrov, A. Koch, S. Hardock, T. Vincon, C. Riegger
In Proc. ICDE 2019

Native storage approaches, architectures and techniques for data processing and data management.

21.11.2019 Paper Accepted at ICDE 2019

I. Petrov, A. Koch, S. Hardock, T. Vincon. C. Riegger Native Storage Techniques for Data Management. In Proc. ICDE 2019.

Abstract:

In the present tutorial we perform a cross-cut analysis of database storage management from the perspective of modern storage technologies. We argue that neither the design of modern DBMS, nor the architecture of modern storage technologies are aligned with each other. Moreover, the majority of the systems rely on a complex multi-layer and compatibility-oriented storage stack. The result is needlessly suboptimal DBMS performance, inefficient utilization, or significant write amplification due to outdated abstractions and interfaces. In the present tutorial we focus on the concept of native storage, which is storage operated without intermediate abstraction layers over an open native storage interface and is directly controlled by the DBMS. We cover the following aspects of native storage: (i) architectural approaches and techniques; (ii) interfaces; (iii) storage abstractions; (iv) DBMS/system integration; (v) in-storage processing.

### DBLab has open-sourced NoFTL, SIAS and cIPT

Check out DBLab's GitHub repository.
We have open-sourced NoFTL, SIAS, and cIPT.

### New Project Grant

PANDAS: Programmable Appliance for Near Data Processing Accelerated Storage

Funding agency: BMBF

Principle Investigators:
PRO DESIGN Electronic GmbH
Xelera Technologies GmbH
Embedded Systems and Applications Group, Technische Universitaet Darmstadt
Data Management Lab, Reutlingen University

### Efficient Data and Indexing Structure for Blockchains in Enterprise Systems

C. Riegger, T. Vincon, I. Petrov.
In Proc. iiWAS 2018

17.09.2018 Paper Accepted at iiWAS 2017

C. Riegger, T. Vincon, I. Petrov. Efficient Data and Indexing Structure for Blockchains in Enterprise Systems. In Proc. iiWAS 2018.

Abstract:

Blockchains yield to new workloads in database management systems and K/V-Stores. Distributed Ledger Technology (DLT) is a technique for managing transactions in âtrustlessâ distributed systems. Yet, clients of nodes in blockchain networks are backed by âtrustworthyâ K/V-Stores, like LevelDB or RocksDB in Ethereum, which are based on Log-Structured Merge Trees (LSM-Trees). However, LSM-Trees do not fully match the properties of blockchains and enterprise workloads. In this paper, we claim that Partitioned B-Trees (PBT) fit the proper- ties of this DLT: uniformly distributed hash keys, immutability, consensus, invalid blocks, unspent and off-chain transactions, reorganization and data state / version ordering in a distributed log-structure. PBT can locate records of newly inserted key-value pairs, as well as data of unspent transactions, in separate partitions in main memory. Once several blocks acquire consensus, PBTs evict a whole partition, which becomes immutable, to sec- ondary storage. This behavior minimizes write amplification and enables a beneficial sequential write pattern on modern hardware. Furthermore, DLT implicate some type of log-based versioning. PBTs can serve as MV-Store for data storage of logical blocks and indexing in multi-version concurrency control (MVCC) transaction processing.

### Two entries in Encyclopedia of Big Data Technologies, Sakr, Sherif, Zomaya, Albert (Eds.), Springer

I. Petrov, T. Vincon, A. Koch, J. Oppermann, S. Hardock, C. Riegger. Active Storage
In Enc. Big Data Technologies Sakr, Zomaya (Eds.) Springer 2018.

I. Petrov, A. Koch, T. Vincon, S. Hardock, C. Riegger. Transaction Processing on NVM
In Enc. Big Data Technologies Sakr, Zomaya (Eds.) Springer 2018.

### NoFTL-KV: Tackling Write-Amplification on KV-Stores with Native Storage Management

T. Vincon, S. Hardock C. Riegger, J. Oppermann, A. Koch, I. Petrov.
In Proc. EDBT 2018

22.12.2017 Paper Accepted at EDBT 2018

T. Vincon, S. Hardock C. Riegger, J. Oppermann, A. Koch, I. Petrov. NoFTL-KV: Tackling Write-Amplification on KV-Stores with Native Storage Management. In Proc. EDBT 2018.

[PDF]

Abstract:

Modern persistent Key/Value stores are designed to meet the demand for high transactional throughput and high data-ingestion rates. Still, they rely on backwards-compatible storage stack and abstractions to ease space management, foster seamless proliferation and system integration. Their dependence on the traditional I/O stack has negative impact on performance, causes unacceptably high write-amplification, and limits the storage longevity.
In the present paper we present NoFTL-KV, an approach that results in a lean I/O stack, integrating physical storage management natively in the Key/Value store. NoFTL-KV eliminates backwards compatibility, allowing the Key/Value store to directly consume the characteristics of modern storage technologies. NoFTL-KV is implemented under RocksDB. The performance evaluation under LinkBench shows that NoFTL-KV improves transactional throughput by 33%, while response times improve up to 2.3x. Furthermore, NoFTL-KV reduces write-amplification 19x and improves storage longevity by imately the same factor.

### Multi-Version Indexing and modern Hardware Technologies

#### A Survey of present Indexing Approaches

C. Riegger, T. Vincon, I. Petrov.
In Proc. iiWAS 2017

02.10.2017 Paper Accepted at iiWAS 2017

C. Riegger, T. Vincon, I. Petrov. Multi-Version Indexing and modern Hardware Technologies - A Survey of present Indexing Approaches. In Proc. iiWAS 2017.

[PDF]

Abstract:

Characteristics of modern computing and storage technologies fundamentally differ from traditional hardware. There is a need to optimally leverage their performance, endurance and energy consumption characteristics. Therefore, existing architectures and algorithms in modern high performance database management systems have to be redesigned and advanced. Multi Version Concurrency Control (MVCC) approaches in data-base management systems maintain multiple physically independent tuple versions. Snapshot isolation approaches enable high parallelism and concurrency in workloads with almost serializable consistency level. Modern hardware technologies benefit from multi-version approaches. Indexing multi-version data on modern hardware is still an open research area. In this paper, we provide a survey of popular multi-version indexing approaches and an extended scope of high performance single-version approaches. An optimal multi-version index structure brings look-up efficiency of tuple versions, which are visible to transactions, and effort on index maintenance in balance for different workloads on modern hardware technologies.

### Write-Optimized Indexing with Partitioned B-Trees

C. Riegger, T. Vincon, I. Petrov.
In Proc. iiWAS 2017

02.10.2017 Paper Accepted at iiWAS 2017

C. Riegger, T. Vincon, I. Petrov. Write-Optimized Indexing with Partitioned B-Trees. In Proc. iiWAS 2017.

[PDF]

Abstract:

Database management systems (DBMS) are critical performance component in large scale applications under modern update-intensive workloads. Additional access paths accelerate look-up performance in DBMS for frequently queried attributes, but the required maintenance slows down update performance. The ubiquitous B + -Tree is a commonly used key-indexed access path that is able to support many required functionalities with logarithmic access time to requested records. Modern processing and storage technologies and their characteristics require reconsideration of matured indexing approaches for todayâs workloads. Partitioned B-Trees (PBT) leverage characteristics of modern hardware technologies and complex memory hierarchies as well as high update rates and changes in workloads by maintaining partitions within one single B + -Tree. This paper includes an experimental evaluation of PBTs optimized write pattern and performance improvements. With PBT transactional throughput under TPC-C increases 30%; PBT results in beneficial sequential write patterns even in presence of updates and maintenance operations.

### SIAS-Chains: Snapshot Isolation Append Storage Chains

R. Gottstein, I. Petrov, S. Hardock, A. Buchmann

27.8.2017 Paper Accepted at ADMS@VLDB 2017

R. Gottstein, I. Petrov, S. Hardock, A. Buchmann. SIAS-Chains: Snapshot Isolation Append Storage Chains. In Proc. ADMS@VLDB 2017.

[PDF]

Abstract:

Asymmetric read/write storage technologies such as Flash are becoming a dominant trend in modern database systems.They introduce hardware characteristics and properties which are fundamentally different from those of traditional storage technologies such as HDDs.

Multi-Versioning Database Management Systems (MV-DBMSs) and Log-based Storage Managers (LbSMs) are concepts that can effectively address the properties of these storage technologies but are designed for the characteristics of legacy hardware. A critical component of MV-DBMSs is the invalidation model. Transactional timestamps are assigned to the old and the new version, resulting in two independent (physical) update operations. Those entail multiple random writes as well as in-place updates, sub-optimal for new storage technologies both in terms of performance and endurance. Traditional page-append LbSM approaches alleviate random writes and immediate in-place updates, hence reducing the negative impact of Flash read/write asymmetry. Nevertheless, they entail significant mapping overhead, leading to write amplification.

In this work we present the Snapshot Isolation Append Storage Chains (SIAS-Chains) that employs a combination of multi-versioning with append storage management in tuple granularity and novel singly-linked (chain-like) version organization.

SIAS-Chains features simplified buffer management, multi-version indexing and introduces read/write optimizations to data placement on modern storage media. SIAS-Chains algorithmically avoids small in-place updates, caused by in-place invalidation and converts them into appends. Every modification operation is executed as an append and recently inserted tuple versions are co-located. SIAS-Chains is implemented in PostgreSQL and evaluated on modern Flash SSDs with standard update-intensive workload. The performance evaluation under PostgreSQL shows: (i) higher transactional throughput - up to 30 percent; (ii) significantly lower response times - up to 7 times lower; (iii) significant write reduction - up to 97 percent reduction; (iv) reduced space consumption and (v) higher tolerable workload.

### Paper Accepted at ICDE 2017

Selective In-Place Appends for Real: Reducing Erases on Wear-prone DBMS Storage
S. Hardock, I. Petrov, R. Gottstein, A. Buchmann.
In Proc. ICDE 2017 [PDF] [Video]

Abstract: In the present paper we demonstrate the novel technique to apply the recently proposed approach of In-Place Appends â overwrites on Flash without a prior erase operation. IPA can be applied selectively: only to DB-objects that have frequent and relatively small updates. To do so we couple IPA to the concept of NoFTL regions, allowing the DBA to place update-intensive DB-objects into special IPA-enabled regions. The decision about region configuration can be (semi-)automated by an advisor analyzing DB-log files in the background.

We showcase a Shore-MT based prototype of the above approach, operating on real Flash hardware. During the demon- stration we allow the users to interact with the system and gain hands-on experience under different demonstration scenarios.

### Paper Accepted at SIGMOD 2017

S. Hardock, I. Petrov, R. Gottstein, A. Buchmann.
In Proc. SIGMOD 2017 [PDF]

Abstract: Under update intensive workloads (TPC, LinkBench) small updates dominate the write behavior, e.g. 70% of all updates change less than 10 bytes across all TPC OLTP workloads. These are typically performed as in-place updates and result in random writes in page-granularity, causing major write-overhead on Flash storage, a write amplification of several hundred times and lower device longevity.

In this paper we propose an approach that transforms those small in-place updates into small update deltas that are appended to the original page. We utilize the commonly ignored fact that modern Flash memories (SLC, MLC, 3D NAND) can handle appends to already programmed physical pages by using various low-level techniques such as ISPP to avoid expensive erases and page migrations. Furthermore, we extend the traditional NSM page-layout with a delta-record area that can absorb those small updates. We propose a scheme to control the write behavior as well as the space allocation and sizing of database pages. We describe how the DBMS buffer and storage manager must be adapted to handle page operations.

The proposed approach has been implemented under Shore-MT and evaluated on real Flash hardware (OpenSSD) and a Flash emulator. Compared to In-Page Logging it performs up to 62% less reads and writes and up to 74% less erases on a range of workloads. The experimental evaluation indicates: (i) significant reduction of erase operations resulting in twice the longevity of Flash devices under update-intensive workloads; (ii) 15%-60% lower read/write I/O latencies; (iii) up to 45% higher transactional throughput; (iv) 2x to 3x reduction in overall write amplification.

### Paper Accepted at EDBT 2017

In-Place Appends for Real: DBMS Overwrites on Flash without Erase
S. Hardock, I. Petrov, R. Gottstein, A. Buchmann. In Proc. EDBT 2017 [PDF]

### 08.08.2013 DFG Flashy-DB Extended

The DFG (Deutsche Forschungsgemeinschaft) has extended FlashyDB the research project. DFG FlashyDB is aiming to investigate the influence of Flash Memory on the database architecture, performance and algorithms.

### 06.08.2013 Program Committee Memberships

Members of the DBlab serve on the program committee of SIGMOD 2014 (Demo Track).

### DBlab Talk

Dr. Knut Stolze, Architect IBM DB2 Analytics Accelerator, IBM Deutschland will give talk 'Managing Large Data Volumes Efficiently with IBM Netezza' on 18.06.2013 at 11:30 in 9-003.

Title:
Managing Large Data Volumes Efficiently with IBM Netezza

Who:
Dr. Knut Stolze, Architect IBM DB2 Analytics Accelerator, IBM Deutschland

When:
am Di. 18.06.2013 um 11:30Uhr | Raum 9-003

Abstrakt [PDF]:
Netezza is a highly specialized database management system for data warehousing operations. In this presentation, Dr. Knut Stolze gives an overview of the its system architecture and the internal query processing. It is shown how very good performance can be delivered with a very simple user interface that avoids indexes. Next, the IBM DB2 Analytics Accelerator is an integration project and commercial product that combines the strengths of Netezza's analytic query processing capabilities with DB2's superior OLTP performance. Knut highlights how the integration of both products is achieved in a (nearly) seamless way.

Zum Vortragenden:
Dr. Knut Stolze is working for the Information Management department in the IBM Research [&] Development Lab in BÃ¶blingen, Germany. He focuses on relational database systems, specifically large scale data warehouse systems. He gained his expertise and experience in academic and industrial research and in product development. His current research efforts focus on enterprise data warehouse systems, in particular technologies like in-memory, specialty hardware for high performance query processing, and database federation. Knut Stolze is a senior software developer and master inventor at IBM. In his role as an architect, he is responsible for the design and implementation of the IBM DB2 Analytics Accelerator for z/OS. Prior to the current project, Dr. Stolze worked in the DB2 Spatial Extender development team, earned his PhD at the University of Jena, Germany, in 2006, and subsequently moved on the the DB2 z/OS Utilities development.

### Accepted at VLDB 2013

S. Hardock, I. Petrov, R. Gottstein, A. Buchmann. NoFTL: Database Systems on FTL-less Flash Storage. VLDB 2013 (Demonstrations Track). Riva del Garda, August 26-31, 2013. [Demonstration Video]

Abstract

The database architecture and workhorse algorithms have been designed to compensate for hard disk properties. The I/O characteristics of Flash memories have significant impact on database systems and many algorithms and approaches taking advantage of those have been proposed recently. Nonetheless on system level Flash storage devices are still treated as HDD compatible block devices, black boxes and fast HDD replacements. This backwards compatibility (both software and hardware) masks the native behaviour, incurs significant complexity and decreases I/O performance, making it non-robust and unpredictable. Database systems have a long tradition of operating directly on RAW storage natively, utilising the physical characteristics of storage media to improve performance.

In this paper we demonstrate an approach called NoFTL that goes a step further. We show that allowing for native Flash access and integrating parts of the FTL functionality into the database system yields significant performance increase and simplification of the I/O stack. We created a real-time data-driven Flash emulator and integrated it accordingly into Shore-MT. We demonstrate a performance improvement of up to 3.7x compared to Shore-MT on RAW block-device Flash storage under various TPC workloads.

### Best Paper Awards

Papers co-authored by members of the DBlab have been given Best Paper Awards:

### Accepted Paper

R. Gottstein, I. Petrov, and A. Buchmann. Append storage in multi-version databases on flash. In Proc. of BNCOD 2013. Springer-Verlag, 2013.

### DBKDA Papers

El-Sheikh, E., Bagui, S., Firesmith, D.G., Petrov, I., Wilde, N., Zimmermann, A.: Towards Semantic-Supported SmartLife System Architectures for Big Data Services in the Cloud. In Proc. Service Computation'13, (2013)

### PC Memberships

iiWAS2013 and ACM PIKM 2013

Members of the DBlab serve on the program comittees of iiWAS2013 and PIKM 2013, at the ACM CIKM 2013.

### DBlab Technical Report

G. Graefe, I. Petrov, T. Ivanov, V. Marinov. A hybrid page layout integrating PAX and NSM. Technical Report (HPL-2012-240). 2012

A technical report (HPL-2012-240) entitled 'A hybrid page layout integrating PAX and NSM' has been published as a cooperation between Hewlett-Packard Laboratories, DBlab, Reutlingen University, DVS, Technical University Darmstadt.
The report available online under: http://www.hpl.hp.com/techreports/2012/HPL-2012-240.html

### DBKDA Papers

DBKDA Paper Accepted

Robert Gottstein, Ilia Petrov, Alejandro Buchmann. Aspects of Append-Based Database Storage Management on Flash Memories.In Proc.DBKDA 2013.

### DBlab Talk

Robert Gottstein (Databases and Distributed Systems Group, TU-Darmstadt) will give talk on the influence of new storage technologies on database systems.

Title: Data Intensive Systems on New Storage Technologies[pdf]
When: 13.12.2012 at 13:00
Where: 9-108.

Abstract: [pdf]

As new storage technologies with radically different properties are appearing (Flash and Non-Volatile Memories), a substantial architectural redesign is required if they are to be used efficiently in a high-performance data-intensive system.

\

Multi-Version approaches to database systems (MVCC, SI) are gaining significant importance and become a dominating trend. They not only offer characteristics that meet the requirements of enterprise workloads, but also provide concepts that can effectively address the properties of new storage technologies. Yet version management may produce unnecessary random writes which are suboptimal for the new technologies.

A variant of SI called SI-CV collocates tuple versions, created by a transaction, in adjacent blocks and minimizes random writes at the cost of random reads. Its performance, relative to the original algorithm, in overloaded systems under heavy transactional loads in TPC-C scenarios on Flash SSD storage increases significantly. At high loads that bring the original system into overload, the transactional throughput of SI-CV increases further, while maintaining response times that are multiple factors lower. SI produces a new version of a data item once it is modified. Both the new and the old version are timestamped accordingly, which in many cases results in two independent (physical) update operations, entailing multiple random writes as well as in-place updates. These are also suboptimal for new storage technologies both in terms of performance and endurance.

We claim that the combination of multiversioning and append storage effectively addresses the characteristics of modern storage technologies.Snapshot Isolation Append Storage (SIAS) improves on SI and traditional "page granularity" append based storage managers. It manages versions as simply linked lists (chains) that are adressed by using a virtual tuple ID (VID). In SIAS the creation of a new version implicitly invalidates the old one resulting in an out-of-place write implemented as a logical append eliminating the need for invalidation timestamps. SIAS is coupled to an append-based storage manager, appending units of tuple versions. SIAS indicates up to 4x performance improvement on Flash SSD under TPC-C workload, entailed by a significant write overhead reduction (up to 38x). SIAS achieves better space utilization due to denser version packing per page and allows for better I/O parallelism and up to 4x lower disk I/O execution times. SIAS aids better endurance, due to use of out-of-place writes as appends and write overhead reduction. Compared to traditional page granularity appends, SIAS achives up to 85% higher read throughput and up to 38x write reduction.

### DBKDA Paper

DBKDA Paper Accepted

DBKDA Paper Accepted

Christian Abele, Michael Schaidnagel, Fritz Laux, Ilia Petrov. Sales Prediction with Parametrized Time Series Analysis. In Proc. DBKDA 2013.

### Data Mining Cup 2012

27.06.2012

Michael Scheidnagel and Christian Abele students in the Data Managment Lab earn 7th place in the overall ranking for the second assignment of the Data Mining Cup 2012.