Description

Interested in Transformers, the deep learning model that has taken the world by storm? Want to have intimate discussions with researchers? If so, this course is for you! It's not every day that you get to personally hear from and chat with the authors of the papers you read!

Each week, we invite folks at the forefront of Transformers research to discuss the latest breakthroughs, from LLM architectures like GPT and Gemini to creative use cases in generating art (e.g. DALL-E and Sora), biology and neuroscience applications, robotics, playing complex games, and so forth!

CS25 has become one of Stanford's hottest and most exciting seminar courses. We invite the coolest speakers such as Andrej Karpathy, Geoffrey Hinton, Jim Fan, Ashish Vaswani, and folks from OpenAI, Google, NVIDIA, etc. Our class has an incredibly popular reception within and outside Stanford, and around 1 million total views on YouTube. Our class with Andrej Karpathy was the second most popular YouTube video uploaded by Stanford in 2023 with over 500k views!

We have significant improvements for Spring 2024, including a large lecture hall, professional recording and livestreaming (to the public), social events, and potential 1-on-1 networking! The only homework for students is weekly attendance to the talks/lectures. Also, livestreaming and auditing are available to the public. Feel free to audit in-person or by joining the Zoom livestream. Anybody can attend, you don't have to be affiliated with Stanford!

We also have a Discord server (over 1500 members) used for Transformers discussion. We open it to the public as more of a "Transformers community". Feel free to join and chat with hundreds of others about Transformers!

Logistics

  • This is a 1-unit S/NC (pass/fail) course. Enroll on Axess as a Stanford student! (Waitlist available)
  • Lectures are on Thursdays at 4:30 - 5:50 pm PDT, Gates Computer Science Building, Room B01 (Basement)
  • Zoom Livestream (Anyone can join!): Link [Meeting ID: 999 2215 1759, Password: 123456]
  • Announcements will be made by email, Discord, Canvas (for students), and this mailing list (for auditors/public).
  • Attendance: Enrolled students should attend in-person (up to 3 absences). During/following each lecture, submit a response here. Note: the form will only open during each lecture.
  • Auditing: Open to everyone! Please join in-person or the Zoom livestream. No need to email us. Join this mailing list for announcements.
  • Questions: There will be an opportunity for questions after each lecture. You can submit questions for the speakers on sli.do, using the code "cs25". Do not unmute on Zoom to ask questions. We cannot guarantee the Zoom chat will be monitored, so please ask questions on sli.do instead.
  • Public (Non-Stanford): There is no way to "officially" enroll in or audit this course (i.e. get a credit/certificate/acknowledgement) unless you are a Stanford student. We are just opening it up to the public for attendance.
  • Contact: If you have any questions about the course, contact us at cs25-spr2324-staff@lists.stanford.edu.
  • Recordings & Slides

  • We plan to publicly release YouTube recordings after each talk at a reasonable pace (i.e. approx. 2 weeks afterward).
  • Recordings of previous talks can be found here. Future recordings will also be posted to this same playlist.
  • Slides will be posted during/after each lecture, on this website (attached to the schedule below), our Discord, and sent by email through the class mailing lists. We will aim to post them in a timely manner (i.e. within a week of each talk).
  • Disclaimers for Students & Attendees

  • In-person attendees: We will be recording, broadcasting (over Zoom), and publishing the speaker presentations to YouTube to help the timely spread of this cutting-edge information. For your convenience, you can also access these recordings by logging into the course Canvas site (students only). Video cameras located in the back of the room will capture the instructor presentations in this course. Note that while the cameras are positioned with the intention of recording only the instructor, occasionally a part of your image or voice might be incidentally captured. Before the recordings are published, an editor will review to remove any student and attendee appearances. If you have questions, please contact a member of the teaching team.
  • Auditors: If the room is full, please give seats to enrolled students who have priority.
  • Zoom attendees: Please do not unmute yourself on Zoom, use the whiteboard functionality, or any other disruptive behavior! If you have any questions/concerns, please send them in the chat; we will be actively monitoring it.
  • Inappropriate behavior will result in blacklist from the course (and maybe other consequences with Stanford).
  • Faculty Advisor

    Schedule

    The current class schedule is below (subject to change):


    Date Title Description
    April 4 Instructor Lecture: Overview of Transformers [In-Person]

    Speakers: Steven Feng, Div Garg, Emily Bunnapradist, Seonghee Lee
    Brief intro and overview of the history of NLP, Transformers and how they work, and their impact. Discussion about recent trends, breakthroughs, applications, and remaining challenges/weaknesses. Also discussion about AI agents. Slides posted here.
    April 11 Intuitions on Language Models (Jason) [In-Person]

    Shaping the Future of AI from the History of Transformer (Hyung Won) [In-Person]

    Speakers: Jason Wei & Hyung Won Chung, OpenAI

    Jason Wei is an AI researcher based in San Francisco. He is currently working at OpenAI. He was previously a research scientist at Google Brain, where he popularized key ideas in large language models such as chain-of-thought prompting, instruction tuning, and emergent phenomena.

    Hyung Won Chung is a research scientist at OpenAI ChatGPT team. He has worked on various aspects of Large Language Models: pre-training, instruction fine-tuning, reinforcement learning with human feedback, reasoning, multilinguality, parallelism strategies, etc. Some of the notable work includes scaling Flan paper (Flan-T5, Flan-PaLM) and T5X, the training framework used to train the PaLM language model. Before OpenAI, he was at Google Brain and before that he received a PhD from MIT.
    Jason will talk about some basic intuitions on language models, inspired by manual examination of data. First, he will discuss how one can view next word prediction as massive multi-task learning. Then, he will discuss how this framing reconciles scaling laws with emergent individual tasks. Finally, he will talk about the more general implications of these learnings. Slides posted here.

    Hyung Won: AI is developing at such an overwhelming pace that it is hard to keep up. Instead of spending all our energy catching up with the latest development, I argue that we should study the change itself. First step is to identify and understand the driving force behind the change. For AI, it is the exponentially cheaper compute and associated scaling. I will provide a highly-opinionated view on the early history of Transformer architectures, focusing on what motivated each development and how each became less relevant with more compute. This analysis will help us connect the past and present in a unified perspective, which in turn makes it more manageable to project where the field is heading. Slides posted here.
    April 18 Aligning Open Language Models [Virtual/Zoom]
    Speaker: Nathan Lambert, Allen Institute for AI (AI2)

    Nathan Lambert is a Research Scientist at the Allen Institute for AI focusing on RLHF and the author of Interconnects.ai. Previously, he helped build an RLHF research team at HuggingFace. He received his PhD from the University of California, Berkeley working at the intersection of machine learning and robotics. He was advised by Professor Kristofer Pister in the Berkeley Autonomous Microsystems Lab and Roberto Calandra at Meta AI Research.
    Since the emergence of ChatGPT there has been an explosion of methods and models attempting to make open language models easier to use. This talk retells the major chapters in the evolution of open chat, instruct, and aligned models, covering the most important techniques, datasets, and models. Alpaca, QLoRA, DPO, PPO, and everything in between will be covered. The talk will conclude with predictions and expectations for the future of aligning open language models. Slides posted here. All the models in the figures are in this HuggingFace collection.
    April 25 Demystifying Mixtral of Experts [Virtual/Zoom]
    Speaker: Albert Jiang, Mistral AI / University of Cambridge

    Albert Jiang is an AI scientist at Mistral AI, and a final-year PhD student at the computer science department of Cambridge University. He works on language model pretraining and reasoning at Mistral AI, and language models for mathematics at Cambridge.
    In this talk I will introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combines their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. I will go into the architectural details and analyse the expert routing decisions made by the model.
    May 2 Developing precision language models from self-attentive feed-forward units, and applying them in edge computing scenarios as untrained language models prompted to predict symbolic switches (U-LaMPS)
    Speaker: Jake Williams, Drexel University

    Jake Ryland Williams is an Associate Professor of Information Science at Drexel University's College of Computing and Informatics in Philadelphia, Pennsylvania. Dr. Williams' undergraduate background is in Physics, and he holds an Applied Mathematics MS alongside a PhD in Mathematical Sciences (all from the University of Vermont). Dr. Williams' PhD was largely in pure mathematics, with doctoral research in quantitative linguistics that applied mathematics to the study of statistical linguistic phenomena, treating the subject as a domain of statistical physics. To conduct this research, the necessities of data processing led Dr. Williams to become a data scientist, which he followed post-graduation into a Postdoctoral appointment in the School of Information at the University of California, Berkeley (Cal). At Cal, Dr. Willams began his career in graduate data science (DS) education on techniques for large-scale machine learning, while he studied opportunities for the application of statistical theory to natural language processing (NLP). Upon becoming a DS faculty at Drexel, Dr. Williams drove the foundation of a DS MS program, where he developed and instructed DS coursework, ultimately in the methodological subject of NLP with deep learning. Teaching NLP with deep learning ultimately brought Dr. Williams to realize an alternative pedagogical model for teaching neural network methodology that integrates theory from traditional statistical learning, which is borne out in his research and this talk.
    Dr. Williams' research develops and applies theory on what neural networks learn (statistically) as a means to improve the design and function of neural architectures and learning processes. This has recently inspired Dr. Williams to invent a range of precision technologies developed for effectively and efficiently training both large and small neural language models, which are capable of greatly reducing the costs of training and infrastructure behind, e.g., OpenAI's ChatGPT. Dr. Williams will discuss these architectures, which modify standard self-attention layers and model long-range dependencies without significant reliance on layer depth. After being introduced, peripheral components of these near-shallow networks—as well as their modified forward operations and learning processes—will be discussed in detail. Following this discussion of architecture and model details, current applications of this research will be presented, which are focused on embedding untrained precision language models (PLMs) on microprocessors in edge computing scenarios, i.e., acting as hardware-based controllers for small electronics devices. Discussion will focus on how these PLM systems have been designed to operate in air-gapped environments over CPU-driven training on microprocessors from scratch, and will go on to detail a fully developed control system of this kind and its user interface. This final subject will present recent positive experimental results at training localized PLMs on Le Potato (https://libre.computer/products/aml-s905x-cc/), whose success was identified upon a U-LaMPS very first training run, in only 20 minutes of lay user interaction through a microphone and light switch.
    May 9 TBD [Virtual/Zoom]
    Speaker: Ming Ding, Zhipu AI
    May 16 TBD [In-Person]
    Speaker: Edward Hu, Prev. OpenAI
    A new training objective for LLMs.

    Recommended Reading:
    1. Amortizing Intractable Inference in Large Language Models
    May 23 TBD
    Speaker: Loubna Ben Allal, Hugging Face
    Code LLMs (e.g. StarCoder).
    May 30 TBD
    Speaker: TBD