CS25: Tranformers United!

Description

Interested in Transformers, the deep learning model that has taken the world by storm? Want to have intimate discussions with researchers? If so, this course is for you! It's not every day that you get to personally hear from and chat with the authors of the papers you read!

Each week, we invite folks at the forefront of Transformers research to discuss the latest breakthroughs, from LLM architectures like GPT and Gemini to creative use cases in generating art (e.g. DALL-E and Sora), biology and neuroscience applications, robotics, playing complex games, and so forth!

CS25 has become one of Stanford's hottest and most exciting seminar courses. We invite the coolest speakers such as Andrej Karpathy, Geoffrey Hinton, Jim Fan, Ashish Vaswani, and folks from OpenAI, Google, NVIDIA, etc. Our class has an incredibly popular reception within and outside Stanford, and around 1 million total views on YouTube. Our class with Andrej Karpathy was the second most popular YouTube video uploaded by Stanford in 2023 with over 500k views!

We have significant improvements for Spring 2024, including a large lecture hall, professional recording and livestreaming (to the public), social events, and potential 1-on-1 networking! The only homework for students is weekly attendance to the talks/lectures. Also, livestreaming and auditing are available to the public. Feel free to audit in-person or by joining the Zoom livestream. Anybody can attend, you don't have to be affiliated with Stanford!

We also have a Discord server (over 1500 members) used for Transformers discussion. We open it to the public as more of a "Transformers community". Feel free to join and chat with hundreds of others about Transformers!

Logistics

This is a 1-unit S/NC (pass/fail) course. Enroll on Axess as a Stanford student! (Waitlist available)

Lectures are on Thursdays at 4:30 - 5:50 pm PDT, Gates Computer Science Building, Room B01 (Basement)

Zoom Livestream (Anyone can join!): Link [Meeting ID: 999 2215 1759, Password: 123456]

Announcements will be made by email, Discord, Canvas (for students), and this mailing list (for auditors/public).

Attendance: Enrolled students should attend in-person (up to 3 absences). During/following each lecture, submit a response here. Note: the form will only open during each lecture.

Auditing: Open to everyone! Please join in-person or the Zoom livestream. No need to email us. Join this mailing list for announcements.

Questions: There will be an opportunity for questions after each lecture. You can submit questions for the speakers on sli.do, using the code "cs25". Do not unmute on Zoom to ask questions. We cannot guarantee the Zoom chat will be monitored, so please ask questions on sli.do instead.

Public (Non-Stanford): There is no way to "officially" enroll in or audit this course (i.e. get a credit/certificate/acknowledgement) unless you are a Stanford student. We are just opening it up to the public for attendance.

Contact: If you have any questions about the course, contact us at cs25-spr2324-staff@lists.stanford.edu.

Recordings & Slides

We plan to publicly release YouTube recordings after each talk at a reasonable pace (i.e. approx. 2 weeks afterward).

Recordings of previous talks can be found here. Future recordings will also be posted to this same playlist. Video links will also be attached directly to the schedule below.

Slides will be posted during/after each lecture, on this website (attached to the schedule below), our Discord, and sent by email through the class mailing lists. We will aim to post them in a timely manner (i.e. within a week of each talk).

Disclaimers for Students & Attendees

In-person attendees: We will be recording, broadcasting (over Zoom), and publishing the speaker presentations to YouTube to help the timely spread of this cutting-edge information. For your convenience, you can also access these recordings by logging into the course Canvas site (students only). Video cameras located in the back of the room will capture the instructor presentations in this course. Note that while the cameras are positioned with the intention of recording only the instructor, occasionally a part of your image or voice might be incidentally captured. Before the recordings are published, an editor will review to remove any student and attendee appearances. If you have questions, please contact a member of the teaching team.

Auditors: If the room is full, please give seats to enrolled students who have priority.

Zoom attendees: Please do not unmute yourself on Zoom, use the whiteboard functionality, or any other disruptive behavior! If you have any questions/concerns, please send them in the chat; we will be actively monitoring it.

Inappropriate behavior will result in blacklist from the course (and maybe other consequences with Stanford).

Previous Iterations

Schedule

The current class schedule is below (subject to change):

Date	Title	Description
April 4	Instructor Lecture: Overview of Transformers [In-Person] Speakers: Steven Feng, Div Garg, Emily Bunnapradist, Seonghee Lee	Brief intro and overview of the history of NLP, Transformers and how they work, and their impact. Discussion about recent trends, breakthroughs, applications, and remaining challenges/weaknesses. Also discussion about AI agents. Recording here. Slides posted here.
April 11	Intuitions on Language Models (Jason) [In-Person] Shaping the Future of AI from the History of Transformer (Hyung Won) [In-Person] Speakers: Jason Wei & Hyung Won Chung, OpenAI Jason Wei is an AI researcher based in San Francisco. He is currently working at OpenAI. He was previously a research scientist at Google Brain, where he popularized key ideas in large language models such as chain-of-thought prompting, instruction tuning, and emergent phenomena. Hyung Won Chung is a research scientist at OpenAI ChatGPT team. He has worked on various aspects of Large Language Models: pre-training, instruction fine-tuning, reinforcement learning with human feedback, reasoning, multilinguality, parallelism strategies, etc. Some of the notable work includes scaling Flan paper (Flan-T5, Flan-PaLM) and T5X, the training framework used to train the PaLM language model. Before OpenAI, he was at Google Brain and before that he received a PhD from MIT.	Jason will talk about some basic intuitions on language models, inspired by manual examination of data. First, he will discuss how one can view next word prediction as massive multi-task learning. Then, he will discuss how this framing reconciles scaling laws with emergent individual tasks. Finally, he will talk about the more general implications of these learnings. Slides posted here. Hyung Won: AI is developing at such an overwhelming pace that it is hard to keep up. Instead of spending all our energy catching up with the latest development, I argue that we should study the change itself. First step is to identify and understand the driving force behind the change. For AI, it is the exponentially cheaper compute and associated scaling. I will provide a highly-opinionated view on the early history of Transformer architectures, focusing on what motivated each development and how each became less relevant with more compute. This analysis will help us connect the past and present in a unified perspective, which in turn makes it more manageable to project where the field is heading. Slides posted here.
April 18	Aligning Open Language Models [Virtual/Zoom] Speaker: Nathan Lambert, Allen Institute for AI (AI2) Nathan Lambert is a Research Scientist at the Allen Institute for AI focusing on RLHF and the author of Interconnects.ai. Previously, he helped build an RLHF research team at HuggingFace. He received his PhD from the University of California, Berkeley working at the intersection of machine learning and robotics. He was advised by Professor Kristofer Pister in the Berkeley Autonomous Microsystems Lab and Roberto Calandra at Meta AI Research.	Since the emergence of ChatGPT there has been an explosion of methods and models attempting to make open language models easier to use. This talk retells the major chapters in the evolution of open chat, instruct, and aligned models, covering the most important techniques, datasets, and models. Alpaca, QLoRA, DPO, PPO, and everything in between will be covered. The talk will conclude with predictions and expectations for the future of aligning open language models. Slides posted here. All the models in the figures are in this HuggingFace collection.
April 25	Demystifying Mixtral of Experts [Virtual/Zoom] Speaker: Albert Jiang, Mistral AI / University of Cambridge Albert Jiang is an AI scientist at Mistral AI, and a final-year PhD student at the computer science department of Cambridge University. He works on language model pretraining and reasoning at Mistral AI, and language models for mathematics at Cambridge.	In this talk I will introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combines their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. I will go into the architectural details and analyse the expert routing decisions made by the model.
May 2	Developing precision language models from self-attentive feed-forward units, and applying them in edge computing scenarios as untrained language models prompted to predict symbolic switches (U-LaMPS) [In-Person] Speaker: Jake Williams, Drexel University Jake Ryland Williams is an Associate Professor of Information Science at Drexel University's College of Computing and Informatics in Philadelphia, Pennsylvania. Dr. Williams' has a background in physics and math with degrees from the University of Vermont, and his research leverages a quantitative linguistics perspective that applies math and statistical methodology to analyze and improve linguistic learning systems, alongside others that utilize shared neural methodology. Following a one-year Postdoctoral appointment at the University of California, Berkeley (Cal) studying large-scale machine learning in 2015, Dr. Willams became a data science (DS) faculty at Drexel, where he drove the foundation of a DS MS program and develops and instructs DS coursework, including on natural language processing with deep learning.	The talk will discuss various effectiveness-enhancing and cost-cutting augmentations to language model (LM) learning process, including the derivation and application of non-random parameter initializations for specialized self-attention-based architectures. These are referred to as precision LMs (PLMs), in part, for their capability to effectively and efficiently train both large and small LMs. Highlighting their hallmark capability for training with only very limited resources, an introduction to PLMs will be followed by presentation of a developing application that localizes untrained PLMs on microprocessors to act as hardware-based controllers for small electronics devices. This will discuss their utility at training in air-gapped environments, training progressively bigger models on CPUs, as well as provide detail on a fully developed control system and its user interface, including recent experiments on Le Potato, where effective inference of user directives occurred after only 20 minutes of lay interaction over a microphone and light switch.
May 9	TBD [Virtual/Zoom] Speaker: Ming Ding, Zhipu AI
May 16	TBD [In-Person] Speaker: Edward Hu, Prev. OpenAI	A new training objective for LLMs. Recommended Reading: Amortizing Intractable Inference in Large Language Models
May 23	TBD Speaker: Loubna Ben Allal, Hugging Face	Code LLMs (e.g. StarCoder).
May 30	TBD Speaker: TBD

CS25: Transformers United V4

Spring 2024

Apr. 4 - May 30