Graduate Database Systems (Spring 2020)

Course Description

The Database Systems field has been exploring issues in data storage, management, processing and analysis for over 50 years. In our data-rich 21st century, the lessons of the field apply broadly across Computer Science and beyond.

In this course we will study a range of papers from the distant and recent past. The goal is to provide a strong foundation for applied computer scientists of many stripes, whether or not they plan to use the techniques in the context of traditional database systems. Given the breadth and history of the field, it is not possible to cover it comprehensively. Our focus will largely be on core database systems topics, and while the emphasis will be on systems work we will not shy away from theory. The readings will mix classic papers with emerging results, and revisit topics that may be ripe for revival given current technology trends.

Course Format

The course will have two phases. For the first few weeks, we will read papers from some of the foundational systems in the field. Classes will be in lecture format for this phase, with class discussion. This will give an overview of recurring themes in the field at a relatively high level, and provide a common basis for subsequent papers.

In the second phase, we will follow a weekly pattern used in other Berkeley graduate courses in recent years. The Thursday lecture will be presented by Prof. Hellerstein and will cover the context of the topic as well as a high-level overview of the reading for the week. The Tuesday lecture will be organized around a mini program committee meeting for the weeks readings. Students will be required to submit detailed reviews for a subset of the papers and lead the paper review discussions. The goal of this format is to both build a mastery of the material and develop your ability to evaluate and review research–including your own.

As a template, we will use these “program committee” template slides from Joseph Gonzalez.

Syllabus

This is a tentative schedule. Specific readings are subject to change as new material is published.

Jump to Today

</tbody> </table> More tentative syllabus [here](https://docs.google.com/spreadsheets/d/1P1FoyTejwmcYAAmiNP9TkgEGltQkD2lCUYEIf1qbcNs/edit?usp=sharing). ## Projects Detailed candidate project descriptions will be posted shortly. However, students are encourage to find projects that relate to their ongoing research. ## Grading Grades will be largely based on class participation and projects. In addition, we will require weekly paper summaries submitted before class. * **Projects:** _60%_ * **Weekly Summaries:** _20%_ * **Class Participation:** _20%_
Week Date (Lec.) Topic
1
1/21/20
( 1 )

Introduction and Course Overview

This lecture will be an overview of the class and requirements. We will go over the history of the field and review some core architectural questions that need to be addressed in any data serving system.

2
1/23/20
( 2 )

Pioneering Systems Group #1: IBM

After the landmark success of System R in the 1970s (which you hopefully studied in CS262A), IBM Almaden’s database group was among the first to tackle the challenge of Distributed Databases in the 1980s. We’ll read two detailed papers from the project, one on transaction processing (specifically two-phase commit) and one on query optimization.

  • Extensions to Starburst: Objects, Types, Functions and Rules Following R*, the Almaden team shifted its focus in the late ’80s and ’90s to database system extensibility in the Starburst project. Many of the ideas in Starburst were also explored in the Postgres and Exodus projects we’ll be reading about shortly. Starburst had significant impact on IBM’s DB2 database through the 1990s and 2000s.
1/28/20
( 3 )

Pioneering Systems Group #2: Berkeley

Berkeley’s INGRES project was the competitor to System R in the 1970s. Postgres was the follow-on project in the 1980’s and ’90s, focusing on building an extensible database system that could handle new data types, richer queries and even ad hoc computation (UDFs) “close to the data”. The programmability of Postgres UDFs presaged the big data movement of the 2000’s. Postgres was a classic “second system”, rich in ideas, not all of which panned out. But Postgres’ extensible architecture arguably saved it from the second system effect, and its architecture today is quite similar to how it looked in the early 1990s.

3
1/30/20
( 4 )

Pioneering Systems Group #3: Wisconsin

Historically the database systems field was relatively small in academia. One of the largest and most influential academic groups was at the University of Wisconsin. In addition to a range of theory work, Wisconsin in the 1980’s and 1990’s pursued themes in both Parallel Databases in the Gamma project (presaging the data warehouse and big data movements) and Object-Oriented Databases in the Exodus and SHORE projects (the “losing side” in the database extensibility wars. Or was it?)

2/4/20
( 5 )

Foundations of Concurrency Control: Bernstein and Goodman

Bernstein and Goodman’s 1981 Computing Surveys paper covers nearly every trick imaginable in concurrency control. Everything since is very likely a combination of techniques in this paper.

4
2/6/20
( 6 )

Concurrency Control Performance

Our first paper is a classic example of good performance work, which cleaned up controversies in prior work by questioning their assumptions and building the right simulation.

Our second paper is a seminal but largely overlooked work by Kung and Papadimitriou defining what “optimality” might mean for concurrency control performance. It embraces au courant ideas like exploiting program semantics and integrity constraints into the definition and looks at how much they can help in theory.

2/11/20
( 7 )

Indexing

Our first paper is a classic reference on B-tree concurrency, a surprisingly tricky topic. Later in the semester we will likely see more recent adaptations of this work for main-memory database. The next two papers represent classic and widely-used indexing techniques, both of which are due to the late Pat O’Neil, who passed away this past year. His wife Betty was a frequent co-author.

Optional reading for those who are curious about concurrency and recovery for extensible DBMSs.

5
2/13/20
( 8 )

Introduction to Weak Isolation and Replica Consistency

I will be out at Microsoft on a RISE faculty expedition. Chenggang Wu will present the lecture.

2/18/20
( 9 )

Weak Isolation

Three papers. A classic, co-authored by Atul Adya and occasional database research contributor and Turing winner Barbara Liskov, a brand new look at the problem co-authored by incoming Berkeley faculty Natacha Crooks, and yet another appearance from Pat and Betty O’Neil discussing a very common weak isolation mode: Snapshot Isolation.

Optional reading from then-Berkeley PhD student Peter Bailis, which identifies and pushes the boundary of what’s possible with coordination-free isolation levels.

6
2/20/20
( 10 )

Introduction to Application-Level Consistency

To introduce this lecture, I’ll start with two conjectures in distributed computing: Eric Brewer’s CAP Conjecture, and my own CALM Conjecture. Each of these was subsequently formalized and proven. I’ll talk about the two conjectures, the two theorems, and the distance between conjecture and theorem in both cases. We will also discuss the relationship between all four.

2/25/20
( 11 )

Application-Level Consistency

In Kung & Papadimitriou, we looked in the (very) abstract at the idea that higher levels of concurrency could be allowed with more application information. This week we look at concrete examples of this idea including an influential design pattern paper from industry (Helland/Campbell) and papers from two of the leading academic efforts of the last decade: Berkeley’s Bloom language and INRIA’s CRDT efforts (which you may have read about in CS262A.)

Additional background reading is offered for each of the papers we’re reading.

7
2/27/20
( 12 )

Introduction to Learning and Database Systems

3/3/20
( 13 )

Learning-Based Query Optimization

Class-led discussion; Prof H will be away.

8
3/5/20
( 14 )

Query Processing Revisited

3/10/20
( 15 )

New Approaches to Join Processing

9
3/12/20
( 16 )

Introduction to Approximate Query Processing

As background, you might want to skim the light optional reading for this day.

3/17/20
( 17 )

Approximate Query Processing

10
3/19/20
( 18 )

Introduction to Main-Memory Query Processing

Lecture Cancelled (Faculty Retreat!)

3/24/20
( 19 )

Spring Break

11
3/26/20
( 20 )

Spring Break

3/31/20
( 21 )

Main Memory Query Processing

12
4/2/20
( 22 )

Introduction to Main-Memory Concurrency Control

4/7/20
( 23 )

Main Memory Concurrency Control

13
4/9/20
( 24 )

Introduction to Programming Support for DB-Backed Applications

Guest lecturer: Alvin Cheung, UC Berkeley

4/14/20
( 25 )

Programming Support for Database-Backed Applications

14
4/16/20
( 26 )

Introduction to Cloud Databases

Guest lecturer: Alex Lloyd (Google)

4/21/20
( 27 )

Cloud Databases

15
4/23/20
( 28 )

Data Wrangling and Cleaning

4/30/20
( 29 )

Paper Discussion TBA

16
5/5/20
( 30 )

Stream Query Processing

5/7/20
( 31 )

Paper Discussion TBA

  • TBA
  • TBA
  • TBA

</div>