CS 122C/222: Principles of Data Management

Syllabus

CS122C/CS222: Principles in Data Management

TuTh, 9:30-10:50, Steinhaus Hall 128

Course Personnel

Staff Office Hours Zoom Email
Instructor: ​Prof. Chen Li

Tuesdays, 11 am - noon

Office hour during the finals week: 3/19, Tuesday, 10 - 11 am, DBH 2086, in-person only

https://uci.zoom.us/j/94788585384 Links to an external site. and DBH 2086 chenli@ics.uci.edu
Assistant: Yunyan Ding

Mondays, 4 p.m - 5 p.m

https://uci.zoom.us/j/2991875437 Links to an external site. ICS 458C

yunyad1@uci.edu

Course Description

This course provides an implementor's view of database management systems. It covers fundamental principles and implementation techniques, issues, and trade-offs related to database management systems. Topics covered include storage management, buffer management, record-oriented file systems, access methods, query processing, and query optimization.  A significant portion of database systems research as well as industrial database and information system development deals with adapting the basic database techniques covered in this course to new advances in hardware and software technologies or to the requirements of different applications and data types.

Class attendance

In-person attendance is required. We will use the lectures to cover many important topics, including discussions about the projects (designs and code reviews).  Missing a lecture will be significant loss for students.

Projects

The project is structured as a four-part, software development exercise:

Project Due Date Allocated Days Topic Proportions (60%)
Project 1 Sunday, Week 2
Jan 20
12

Getting Started, Paged File Manager (PFM) & Record-Based File Manager (RBFM)

14%
Project 2 Sunday, Week 5 
Feb 10
21

Relation Manager (RM)

17%
Project 3 Sunday, Week 8
Mar 3
21

Index Manager (IX)

17%
Project 4 Sunday, Week 10
Mar 17
14

Query Engine (QE)

12%

Please refer to the Assignments link for more detailed instructions.


Lectures

Date Topic Readings Notes
01/09/24 (Tu) Course Overview, DBMS Overview, Storage, P1 Ch. 1; Chamberlin et al 1981 01 Download 01
01/11/24 (Th) P1: PagedFileManager and RecordBasedFileManager,   Ch. 9.1, 9.2, 9.6, 9.7 02 Download 02
01/16/24 (Tu) Disks, Record/page formats
01/18/24 (Th)

deleteRecord(), updateRecord(), PAX, Column/Row stores, OLTP/OLAP

Ch. 9.3, 9.4, 9.5 03 Download 03
01/23/24 (Tu) Parquet, Heap Files, Buffer Manager, code reviews Ch. 12.1
01/25/24 (Th) P2, Catalog Ch. 8.4 04 Download 04
01/30/24 (Tu) Schema versioning, File Organizations, Index Overview Ch. 8.4, Ch. 10 05 Download 05
02/01/24 (Th) Clustered index, B+ tree Ch. 10.3-10.6, Graefe 2011 (Sec. 1-3) 06 Download 06, 07 Download 07
02/06/24 (Tu) B+ tree Ditto Ditto
02/08/24 (Th) Hash tables, index comparison Ch. 11, Litwin 1980 08 Download 08
02/13/24 (Tu) In-class midterm  
02/15/24 (Th) P3 overview; Index performance analysis, Composite index Ch. 8.4 - 8.6 09, Download 09,10 Download 10
02/20/24 (Tu) Index-only plans, External Sort Ch. 13, Ch. 14.1 - 14.2 11 Download 11
02/22/24 (Th) Selection Operator, Projection, Duplicate Elimination, Join (Part 1) Ch. 14.5, 14.6 12 (s19) Download 12 (s19)
02/27/24 (Tu) Join (Part 2) Ch. 14.5, 14.6 13 (S20) Download 13 (S20)
02/29/24 (Th) Aggregation, set operators,  Ch. 15.2 - 15.4 14 (S10) Download 14 (S10)
03/05/24 (Tu) P4 overview, optimization overview   Ditto (S14)
03/07/24 (Th) Cost Estimation, Histograms  Ch. 15.6 15 (S17) Download 15 (S17)
03/12/24 (Tu) System-R optimizer, Volcano optimizer  Selinger et al 1979 16 Download 16
03/14/24 (Th) Parser, Other Indexes, Other Data Systems, Storage-compute decoupling (e.g., Snowflake, Databricks)   17 Download 17

 


Working on projects as teams

For CS122C undergraduates

Students form a team of up to two (2) members to do the projects. 

1) Both members should access the same github repository, and each team only needs one submission.

2) A student can always leave his/her team during the quarter with a notice at least two weeks prior to the next project's deadline. Note that it is not permitted to join a new team after splitting. After a split, each member should have the access to the codebase before the split. 

3) For each project done by a student individually, you will get a 10% extra credit.

4) Please sign up and include both team members in the team signup spreadsheet Links to an external site. to form the team.

For CS222 graduate students

Students do the projects individually. 


Online Discussion

We are using  Links to an external site.Ed Discussion Links to an external site. as the forum for course discussion. Students are required to check all the posts frequently for important discussions and announcements.

Rather than emailing questions to any of us on the teaching staff, we'll ask you to post your questions on Ed Discussion. 

  • Please use the forum properly. It's a place for students to exchange ideas. Don't post "easy" or "random" questions without much thinking.
  • To encourage students to actively participate in the forum discussions and provide high-quality answers, we will select two students with the best discussion performance. Each student will get 2% extra credits in the overall scores. We reserve the right not to choose any students to receive the credits.
  • Please make this online discussion (at least monitoring the activity there) a part of your regular routine!

Prerequisites

Please don't attempt this class without the required background! Doing so has not proven to be a great idea for students historically. :-) You should have the following prerequisites: 

    1. CS122A: Introduction to Data Management

    2. And one of the two courses below:

Github and GradeScope

  • You are required to use Github to manage your source code. 
  • We are also using GradeScope for grading. Learn more about auto-grading with GradeScope AutoGrader here. Links to an external site.
  • Each project will have a corresponding GradeScope assignment that will be released according to our assignment schedule. You can either submit by uploading your source code or by pulling the source code from your GitHub repo, and can submit arbitrary number of times before the deadline. After submission the AutoGrader will run automatically and after it finishes running you should be able to see if you can pass each test case immediately.
  • Students are allowed to submit after the deadline of the project. But if you submit on GradeScope during Grace Period, we consider you using the Grace Period. So if you don't want to use Grace Period, please do not submit assignments on GradeScope during the 48 hours after the deadline.

Grading:

  • Midterm Exam: 20%, 02/13/23 (Tuesday), in-class, in-person.
  • Final Exam: 20%, 3/21/2024, Thursday,  8:00 am - 10:00 am, in-person (according to the University schedule).
  • Programming Projects: 60%

This mixture of grading criteria for this course is intended to give equal "excelling opportunities" to both thinkers and ``doers.'' This is a hands-on project class, so expect to put a significant effort into your projects. The exam will be comprehensive with respect to the course material and will also ping you with respect to project knowledge. 

For all the graded projects and exam, if you disagree with the grading, you can discuss with us within one week after they are returned. After that, all the grades will be finalized. For the final exam, the disputation deadline is subject to change upon University's grade submission timeline.


Textbooks

Required: Database Management Systems, 3rd edition, by R. Ramakrishnan and J. Gehrke, McGraw Hill, 2003.
Recommended: Readings in Database Systems, 4th edition, by J. Hellerstein and M. Stonebraker, MIT Press, 2005.


Other Readings

The following papers are mostly drawn from the Hellerstein and Stonebraker book or can be found via UCI's ACM Digital Library subscription (http://dl.acm.org Links to an external site.). You will be responsible both for the material covered in the lectures and the material covered in the readings listed in the Syllabus below.

  • Abadi, D., Madden, S., and Hachem, N. 2008. Column-stores vs. Row-stores: How Different are They Really?. Proc. ACM SIGMOD Intl. Conf. on Management of Data (Vancouver, Canada, June 2008).
  • Alsubaiee, S., et a., 2014. AsterixDB: a scalable, open source BDMS. Proc. VLDB Endow. 7, 14 (Oct. 2014), 1905-1916.
  • Chamberlin, D., et al, 1981. A history and evaluation of System R. Commun. ACM 24, 10 (Oct. 1981), 632-646.
  • Dean, J., and Ghemawat, S., 2004. MapReduce: Simplified data processing on large clusters. Proc. 6th USENIX Symp. on Op. Sys. Design and Impl. (OSDI'04) (San Francisco, CA, December 2004).
  • DeWitt, D. and Gray, J. 1992. Parallel database systems: The future of high performance database systems. Commun. ACM 35, 6 (June 1992), 85-98.
  • Graefe, G., 2011. Modern B-Tree Techniques. Foundations and Trends in Databases 3, 4 (Sept. 2011) 203-402. (Available from the UCI intranet at: http://dx.doi.org/10.1561/1900000028 Links to an external site.)
  • Guttman, A. 1984. R-trees: A dynamic index structure for spatial searching. Proc. ACM SIGMOD Intl. Conf. on Management of Data (Boston, Massachusetts, June 1984).
  • Litwin, W., 1980. Linear hashing: A new tool for file and table addressing. Proc. 6th Int'l. Conf. on Very Large Data Bases, Montreal, Oct. 1980.
  • Nievergelt, J., Hinterberger, H., and Sevcik, K. C. 1984. The Grid File: An adaptable, symmetric multikey file structure. ACM Trans. Database Syst. 9, 1 (March 1984), 38-71. (optional)
  • Selinger, P. G. et al, 1979. Access path selection in a relational database management system. Proc. ACM SIGMOD Int'l. Conf. on Management of Data (Boston, Massachusetts, May-June 1979).
  • Shapiro, L. D. 1986. Join processing in database systems with large main memories. ACM Trans. Database Syst. 11, 3 (Aug. 1986), 239-264.
  • Stonebraker, M. 1981. Operating system support for database management. Commun. ACM 24, 7 (July 1981), 412-418.

Programming Project

This course is intended to teach the principles involved in DBMS implementation, so it will include a significant programming component.

  • The project will be aimed at giving you a sense of what goes on inside a DBMS, and what the issues and challenges are in building a system "now" that will be managing data of various user-defined shapes and sizes "later". You will try your hand at record-based file storage, indexing, and relational runtime system programming (a.k.a. query operators).
  • The language for the project will be C++ - hopefully, you either have C++ experience or you have Java experience and can learn C++ quickly. Our assistant(s) will run the project, and they can provide you with additional suggestions for material to help you "spin up" on C++ programming, debugging, and tools in case you need to do a bit of remedial work on the side in order to tackle the project successfully. The course has the same requirements for both graduate students and undergraduate students.

Notice about Projects

  • Students are allowed to add attributes, methods, and classes, as long as the public signatures provided in the codebase are kept and implemented. We will be testing your code by calling the methods that are declared in the signatures. See README.md in the project for more details.
  • We have provided Gtest cases for testing and grading purposes. The test cases we used for grading might be different than the ones provided to you.

Project Late Policy

  • The official due date for each assignment is listed above, and it is expected that students will turn most work in on or before that date.
  • Grace Period
    • We will offer a 48-hour "grace period" for each assignment, and will therefore accept solution submissions that are turned in within 48 hours of the due date (i.e., less than 2 days late). No need to ask - this time is yours in the event that you need it for some unforeseen reason.
    • If you use it, your score will be deducted by 10 points. You are recommended NOT to use the grace period, as you will lose not only 10 points, but also valuable time from the next assignment's working period for every hour that you run late.
  • Late assignments after the grace period will NOT be accepted, so do always aim to be on time! (Please don't even ask, as this is what the 48-hour grace period is intended for. :-))

C/C++ Resources


Cheating Policy

  • Cheating is an area where the instructor for this course has absolutely no patience or sympathy whatsoever. You are all here to learn, and cheating defeats that purpose and is unfair to your fellow students. Students will be expected to adhere to the UCI and ICS Academic Honesty policies (and should see http://www.editor.uci.edu/catalogue/appx/appx.2.htm#academic and http://www.ics.uci.edu/ugrad/policies/index.php#academic_honesty to read their details).
  • We run a software package to search for cheating cases in projects. Any student found to be cheating or aiding others in cheating will be academically prosecuted to the maximum extent possible — so you may fail the course with no option for a second chance. This policy will be strictly adhered to; no exceptions will be possible regardless of the impact it may have on your studies, your work plans, your visa status, or anything else. Don’t cheat and you won’t have issues. Just say no to cheating!

Comprehensive Exam Option

Comprehensive Exam Signup Sheet Links to an external site.

  • Computer Science graduate students wishing to satisfy the MS Comprehensive Exam requirement in Database Systems via this course should notify the instructor and the Student Advising Office (SAO) of this intention via e-mail before the day of the final exam (alternatively, you can directly add your name to the sheet below).
  • For such students, a weighted final exam performance will be used to determine your MS Comprehensive Exam grade (P or F) at the end of the course, and this grade will be reported to SAO.
  • Students who do not pass successfully can try the exam again the next time this course is offered by taking the final exam (and midterm exam, if offered) in that offering of the course; hopefully, this will not be necessary for anyone. (Students who do this are advised to informally audit the next offering of the class; this is because the course emphasis and expected knowledge can vary a bit by term and by the instructor, and also because you'll probably need the review to pass.