COMPSCI 256: SYSTEMS AND ML (34690)

COMPSCI 256 : Systems and Machine Learning

Welcome to the graduate course on Systems and Machine Learning! This is a research-oriented course covering topics on Systems for Machine Learning and Machine Learning for Systems. 

 

Machine learning is transforming several domains ranging from natural language processing to drug discovery today. One of the key factors that enabled rapid progress in ML/AI in recent years has been fast-evolving underlying hardware and software platforms. In this course, we will cover recent advancements in research and industry on machine learning systems that enabled the AI/ML revolution. Specific topics include domain-specific architectures, deep learning frameworks and compilers, networking and scheduling in deep learning clusters, etc. We will also discuss practical challenges in deploying such systems. In the second half of the course, we will explore how machine learning has been employed to tackle various networking and systems challenges such as Internet congestion control, adaptive bitrate selection in video streaming, flow prediction, etc. 

 

Instructor: Sangeetha Abdu Jyothi

Class Hours: MW 5:00-6:20 PM

Location: PCB 1200

Office Hours: MW 6:20-6:50 pm (in person after class)
and Tue 2:30-3:30 pm on Zoom (https://uci.zoom.us/j/97986993496?pwd=M3FYT2o3a1JGZ2FxUUNKUUVEL3NUQT09 Links to an external site.) or by appointment 

TA: Kapil Agrawal

TA Office Hours: Tue 4:00 - 5:00 PM, ICS 415

Course Policies: Course Policies

Prerequisites: Understanding of basic concepts in machine learning and systems (taken at least one undergrad course in ML and (networking or operating systems or distributed systems))

 

Schedule: (Papers may be updated as the quarter progresses. Please check back periodically. The paper for each class will be final one week prior to the class. )

Date Category Lecture Required Reading Optional
10/02/23 Lec 1: Introduction
10/04/23 Systems For ML Lec 2: Domain Specific Architectures In-Datacenter Performance Analysis of a Tensor Processing Unit Links to an external site. (ISCA'17) A Configurable Cloud-Scale DNN Processor for Real-Time AI Links to an external site. (ISCA'18)
10/09/23 Systems For ML Lec 3: DL Frameworks PyTorch Links to an external site. (NeurIPS'19) TensorFlow Links to an external site. (OSDI'16)
MXNet Links to an external site.

CNTK Links to an external site. (KDD'16)
10/11/23 Systems For ML Lec 4: DL compilers TVM Links to an external site. (OSDI'18) Glow Links to an external site.
nGraph Links to an external site.
XLA Links to an external site.
TensorComprehensions Links to an external site.
10/16/23 Systems For ML Lec 5: Networking Challenges in DNN training Parameter Server Links to an external site. (OSDI'14)

Horovod Links to an external site. (arXiv'18)
BytePS
Links to an external site.
(OSDI'20)
TicTac Links to an external site. (MLSys'19)
P3 Links to an external site. (MLSys'19)
10/18/23 Systems For ML Lec 6: Cluster scheduling for DL workloads Gavel Links to an external site. (OSDI'20) AntMan Links to an external site. (OSDI'20)
Themis Links to an external site. (NSDI'20)
Gandiva Links to an external site. (OSDI'18)
10/23/23 Systems For ML Lec 7: Automated ML: Hyperparameter Tuning and Neural Architectural Search (NAS) Hyperband Links to an external site. (JMLR'18)

Cerebro Links to an external site. (VLDB'20)
Vizier Links to an external site. (KDD'17)
ASHA Links to an external site. (MLSys'20)
Designing Neural Architectures using RL Links to an external site. (ICLR'17)
10/25/23 Systems For ML Lec 8: Scaling RL
Asynchronous Methods for Deep Reinforcement Learning Links to an external site. (ICML'18)

RLlib Links to an external site. (ICML'18)

XGBoost Links to an external site. (KDD'16)

10/30/23 Systems For ML Lec 9: ML at the Edge Towards Federated Learning at Scale: System Design Links to an external site. (MLSys'19)
11/01/23 Systems For ML Lec 10: LLM training
TeraPipe (ICML'21) Links to an external site.

LightSeq2 (SC'22) Links to an external site.
PipeFisher (MLSys'23) Links to an external site.

11/06/23 Systems For ML Lec 11: LLM serving Orca (OSDI'22) Links to an external site. Scaling Transformer Inference (MLSys'23) Links to an external site.
11/08/23 Systems For ML
Lec 12: Challenges in Operational ML Systems

Hidden Technical Debt in Machine Learning Systems Links to an external site. (NeurIPS'15)

TFX Links to an external site. (KDD'17)

Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective Links to an external site. (HPCA'18)

11/13/23 ML for Systems Lec 13: Deep RL and Challenges in the real world Deep RL tutorial (background reading, not for review)
Links to an external site.
Challenges of Real-World Reinforcement Learning Links to an external site. (ICML'19)
11/15/23 ML for Systems Lec 14: Classic control + Learning
(congestion control & ABR)

Links to an external site.
Classic Meets Moder Links to an external site.n (SIGCOMM'20)
Aurora Links to an external site. (ICML'19)
Pensieve Links to an external site. (SIGCOMM'17)
Learning in-situ Links to an external site.
(ATC'20)
11/20/23 ML for Systems Lec 15: ML for Index Structures Learned Index Structures Links to an external site. (SIGMOD'18)
11/22/23 ML for Systems Lec 16: ML for Flow Prediction Flux Links to an external site. (NSDI'19)
11/27/23 ML for Systems Lec 17: Interpretability of RL-based controllers Metis Links to an external site. (SIGCOMM'20)
11/29/23 Project Presentations
12/04/23 Project Presentations
12/06/23 Project Presentations

 

Grading

Paper Summaries: 40%

Project: 60% 

(Title and plan: 5%

Checkpoint 1 (Report + Recorded presentation): 15%

Checkpoint 2 (Report): 10%

Presentation: 15%

Final report: 15%)