DAVID J. MALAN: All right, this is CS50, and this is the CS50 Educator Workshop and our session on "Teaching CS50 with AI," in which we'll speak to exactly how and why we've leveraged artificial intelligence to actually teach the course and support students therein. And we started this project really over a year ago now with this working premise that tools like ChatGPT, which came onto the scene in November of 2020-- 2022, seemingly overnight. 

But the working premise for us is that tools like this really out of the box, working in their default way, are really too helpful. Tools like ChatGPT, Bing Chat, and the like are all too eager to answer each and every question that you might ask of them. And even if that question is, can you do my homework for me, they are all too willing to oblige. So ironically, we set out a year plus ago to put in place some pedagogical guardrails to really make those tools less useful for students or really more properly stated, more educationally useful for students so that the tools might guide students toward useful answers and insights but not simply spoil answers to problems outright. 

In terms of policy, this is literally the language we adopted early on in the course's syllabus for students, whereby we said that it is not reasonable, that it is not allowed, to use AI-based software like ChatGPT, GitHub, Copilot, Bing Chat, and the like that suggests or completes answers to questions or lines of code. This is not something we can enforce technologically. 

Even if the school or your school could prohibit access to certain websites with a firewall, the reality is most any student could still take out their own device, whether on campus or at home, and still access those services. So we've also woven into the course all the more of a discussion of ethics, academic honesty so that students at least are cognizant of where we are coming from and where those lines are. However, we didn't want to simply take away tools that arguably are going to be with us now here on out and do have value. 

So we deemed it reasonable for students to use CS50's own AI-based software, including the CS50 Duck in virtual form, otherwise known as the Duck debugger, or ddb, in our own web-based applications, namely CS50.ai, which you'll soon see, as well as CS50.dev, the Codespaces environment you saw yesterday as well. And so we really set out to bring this virtual duck to life. 

And in fact, thanks to CS50's own Brian Yu, for a few years, we had a chat bot of sorts built into VS Code and our previous programming environment, whereby if students typed a question into a chat box, they would actually get back a response. But the response wasn't all that helpful. 

But the goal was to indeed implement what's called rubber duck debugging or rubberducking, the idea being that in the absence of a smart colleague, family member, friend, or really just someone more technically proficient than you, you should at least keep on your desk a rubber duck or any inanimate object really so that you can talk to it about any problems or bugs that you're having. And the idea is per yesterday that invariably in expressing your thoughts step by step, ideally to this inanimate object, the proverbial light bulb goes off, and you understand wherein you are confused about some topic or bug. 

We try to bring this to life historically in CS50 not only with small rubber ducks but very large ones as well. But the virtual version that we had for some time looked a little something like this in CS50.dev, that is, Visual Studio code. A student might begin a conversation with, "I'm hoping you can help me solve a problem." And up until fairly recently, all this duck would do that Brian implemented for students was quack back either once, twice, or three times. 

So it wasn't completely simple. There was some pseudorandomness in there but still just really one line of code. But I dare say quite a few students online and eventually on campus were fairly astonished where after two plus years of the duck only knowing how to quack-- suddenly and literally overnight started to responding to students in English or in some cases in other human languages form, depending on how students initiate the conversation. 

And so really this virtual rubber duck was brought all the more to life. And educationally, what our goal has been is really this, to provide students with virtual office hours 24/7. Even at a place like Harvard or Yale, where we're fortunate to have lots of teaching assistants and so forth, the reality is we've never really been able to keep up with the volume of questions, the demand for support. And so providing students electronically somehow with all the more support in addition to our traditional human support structure has very much been of interest. 

And I dare say this really is the Holy Grail, to approximate ultimately a 1 to 1 teacher to student ratio, which we have never had here. Indeed even within our campus sections or recitations, which are led by our undergraduate teaching fellows or teaching assistants, we might have a ratio of 1 to 6 at best, 1 to 12 more commonly, 1 to 20 or more. And if you think about just mathematically how many minutes there are in an hour, if you've got six students in a room over the course of an hour, that's only 10 minutes per student. And that's not nearly enough quite often for those students coming in with less strong STEM backgrounds or simply struggling more generally with a new world. 

So what we thought we'd do today is share with you a little bit of the features that are student and teacher facing but also some of the underlying architecture so that you as teachers actually understand how today's forms of artificial intelligence are working if only so that you feel more comfortable and confident or skeptical about how these tools are actually built and what they can do. Or educationally, you can teach your own students about some of these underlying technologies that are increasingly around us. 

For my part, I'll focus really on the user-friendly side of this diagram, which is to say the user interface. And then we'll dive more deeply into what CS50.ai is. We'll talk about a technology known as a vector database, which will allow us to enhance our searches and answers to questions. And then we'll talk a little bit about the cloud services like OpenAI and Azure that underlie CS50's own implementation of this rubber duck. 

So thank you to our friends, first and foremost at GitHub, at OpenAI, at Microsoft who have wonderfully been making a lot of this support possible through cloud-based credits such that all of the tools you are about to see and the tools with which you might already be familiar firsthand have been made freely available to teachers, to students alike. So everything you're about to see you may use indeed with your own students. 

So you've seen this environment already, VS Code for CS50 at cs50.dev. This is a landing page via which students can log into that website. And what we set out to do early on in this process of trying to embrace artificial intelligence was think about what problems could we solve with AI. And so we set out first to do something fairly simple and bite size just over a year ago. Could we write a program that somehow uses AI to just explain code to students? 

And this is something that does need AI because we couldn't possibly write a program that has enough if, else if, else if, else if, else if, else if conditionals to be able to deterministically tell students what a given line of code is unless we literally have an if condition for every possible line of code they might see or write so that we can pattern match on it. So we really need a more generalized solution. And this is where AI really shines. 

So here's a screenshot of VS Code in night mode. Here then is some sample code written in C. This is simply a program that says hello so and so after prompting the human for their name. But this is perhaps representative of a program that a student might not understand. So with this extension built into cs50.dev, students can highlight one or more even all lines of code, Right-Click or Control-Click. 

And this option here in the middle, Explain Highlighted Code, even though it looks officially built in, it's not. We contribute it via a CS50 specific extension that we pre-install into cs50.dev so that when students click on that, they then see a ChatGPT-like explanation of the very lines of code that they have highlighted. So not necessarily the most complicated code, but you could imagine a teacher, a teaching assistant, spending a good number of minutes writing out something so thoughtfully step by step, line by line to explain that code. 

Here, though, students have an answer within, say, three seconds, instantaneously automatically generated. So that actually worked fairly well and wasn't that hard because a lot of the tools out there nowadays, OpenAI included, have been trained on so to speak, lots and lots of code from the internet, for better or for worse. 

And so even though we did not write a program that says if, else if, else if, else if, else if to recognize students code, odds are over time certainly for "Hello, world type" programs, someone out there wrote something like this, someone on Stack Overflow, Reddit, or elsewhere probably explained code like that. And so tools like ChatGPT sort of know, therefore, how to explain these particular lines of code. 

Well, what else might we do? So version 2 of our implementation of AI tools for CS50 students were, could we advise students how to prove their code style? So many of you know we've had that "command line terminal based" tool called style50, which just tells you how to format your code line by line if it's inconsistent with a style guide. Well, we thought we'd try to do something more graphically and a little more explicitly so that students understand step by step what it is they need to do. 

So here's some more C code. It's pretty messy. Everything is left aligned. There's no indentation. So it's not pretty printed even though it might be correct. If students now click a button at top right labeled style50-- so we've turned it into a button. Instead of just a command, they'll see side by side a so-called diff, D-I-F-F, which shows the difference between the way their code currently looks and the way we think their code should look per CS50's own style guide. 

And pictured there in green are sort of highlighted spaces that the students should add with their spacebar or tab character highlighted in red or perhaps things that they should delete or change. So we show students think side by side. But if they don't quite grok what the difference is or why they might do that, there's also a Explain Changes button, which, if students click on it, opens up, as before, a ChatGPT-like explanation of what they should do and/or why. 

Now students, once they get comfortable with this and once they want to automate this, they can actually just click Apply Changes at top right also. And that will just reformat their code for you. And the most sophisticated students can certainly go into VS Code and just enable automatic auto formatting. We actually disable all of those user-friendly features for pedagogical reasons. But the goal here, at least in the early weeks of the class, is to at least put in front of students the side by side and just get them thinking about and developing the muscle memory for making these kinds of improvements themselves. 

Well, then we set out more educationally minded to try to answer students' questions. And this tool here even though we use it for computer science and programming, really is generalizable. So those of you who teach in other subjects beyond CSS and programming might take an interest in somehow trying to leverage this same idea for other domains as well. But could we through AI answer most of the questions asked online by students? So for CS50 on campus as well as CS50x online, we've long used, among other tools, a Q&A a tool, question and answer tool, called Ed for educational discussions. 

And here's a screenshot thereof. And this is a screenshot of a representative question a student might ask called, say, John Harvard. What is flask exactly they might ask? Now, years ago, I or one of our teaching assistants on campus would log in and answer this question by typing out a response, a sentence, a paragraph, or more. And frankly, this is an easy question, right? This is very googleable in the first place. It's very definitional. So this is not a hard question, but this is where we began. 

This is an actual screenshot of how the Duck built with AI responded to this particular question. Flask is a micro framework written in Python, dot, dot, dot. And I'll stipulate for the purposes of today. It's a pretty good answer. It's not hard. Bing, Google, any number-- I mean, even Yahoo could have figured this answer out. But in our case, we wanted it to be dynamically generated in the Q&A environment the students are already using. 

But within this environment, there's an opportunity for students to ask more sophisticated questions as well. So here's a question about the Caesar problem set, an encryption-related problem. Here's an error message formatted in monospaced font. And then down here, there's a more nuanced question asked by the student. It's not just a definition. It's not even yes or no. Is there a more efficient way to write this code? 

And so here too I dare say, AI even in 2024 is already starting to shine in terms of quality of results. And this is kind of a mouthful, but let me stipulate for today's purposes. This is a pretty darn good answer. It goes into an explanation of what the student seems to be doing and what they could potentially do instead. 

It doesn't give them too much code. It, in fact, just gives them some boilerplate code here in a monospaced font. But it does disclaim to them here at the bottom, "PS, I'm an experimental bot. Quack. Do not assume that my reply is accurate unless you see that it's been endorsed by human staff. Quack." 

Now, this is a short-term intervention because many of you might be familiar in general with the imperfections of AI still. And I think they're going to diminish in frequency, but they are with us for now and probably some time. So this is just a reminder to students that this is all very much beta. But we do have a feature within Ed that namely is this button, Endorsed. 

And so the way we have mitigated this educational concern that sometimes AI might just hallucinate or make stuff up is that we click as a human this Endorsed button just to additionally signal to students that even though the AI duck generated the response within three seconds, usually there's enough of us awake at all hours that within a few minutes, a few hours, we add this additional signal to students that, yes, this AI-generated response is trustworthy because a human has weighed in on it. Frankly, I do think this is a short-term intervention. And eventually we will not be bothering to endorse answers anymore once we're all the more confident in answers. 

Now, we've also implemented at CS50.AI a web application that students can use for free with any GitHub account on their laptop, desktop, or even phone that provides a much more ChatGPT-like interface that's conversational, with which the students can interact back and forth, back and forth, just like any chat bot. It too starts with a similar disclaimer at the top to remind students to think critically. And then they can proceed to ask any and all questions. 

And for instance here, I dare say this is rather representative of the types of questions a student might ask. My code is not working as expected, any ideas? But because so many students around the world have asked questions like these and adults, for that matter, on sites like Reddit and Stack Overflow and the like, they're there too. There's a lot of training data on which artificial intelligence has been prepared. And so when the student provides not only that vague question but a bit of code, even this duck implemented on top of those AIs can actually give a pretty darn good answer. 

And in fact here, the ddb, duck debugger, responds with, "It seems you're trying to add two integers. But the input function in Python returns a string. So when you try to add x and y, you're actually concatenating two strings." And then goes on to provide a bit more detail and an actual solution-- so we're actually finding that students are increasingly using this conversational UI instead of the more asynchronous Q&A tools that we described earlier. There are some other features built into this same interface, these hearts up at the top. 

Those of you who have played Zelda or similar games might know of HP and how you lose over time as you get hurt or tired in the game hearts until they eventually regenerate. And we implemented this for a couple of reasons. One, we definitely saw a long tail of students asking probably too many questions. I don't know precisely what the wrong number of questions is. But when you're asking dozens, hundreds of questions of the duck about some similar topic, I think back to high school where I had pretty good but pretty mean teachers who would eventually send me back to my desk to think about the answer they had just given me before they'll entertain additional questions. 

And so through software, we've tried to approximate that idea of pumping the brakes for students to get them to reflect on what information they've already been given as opposed to just fishing continually for more and more help with the same question. This also tends to keep, frankly, utilization down overall to keep costs in check so that everything can indeed be free for all students and teachers. 

So how does this all work? And how might you, using our tools or even other current and future AI tools, actually implement functionality like this? Well, in the world of AI right now, there's this notion of prompting and prompt engineering. So prompt engineering isn't really engineering, but it's the art of asking good questions of today's artificial intelligence. A system prompt in particular are instructions that someone you, me, a programmer, any teacher gives to an AI in advance, typically to give it a personality or to give it some guidelines, really those pedagogical guardrails that I referred to earlier. 

So in our case, for instance, we literally have instructions built into CS50.AI that read a little something like this. This is an abbreviated version thereof. Quote, unquote, "You are a friendly and supportive teaching assistant for CS50. You are also a rubber duck." Those two English sentences are sufficient for programming an off-the-shelf AI, like OpenAI's or someone else's, to not only know what its purpose in life is supposed to be as an AI but also that its persona should be that of a rubber duck. 

We go on, though, in our system prompt to say, "Answer student questions only about CS50 in the field of computer science. Do not answer questions about unrelated topics. Do not provide full answers to problem sets as this would violate academic honesty. Answer this question, colon." And then essentially, we prefix-- we prepend this system prompt to every student question so that the answer they get from the underlying AI respects not only their question, but also this system prompt that we fed in as well. 

And the student's question is what the world of AI generally nowadays calls a user prompt. So system prompt comes from us or the system administrators or the programmers. The user prompt comes indeed from the users. Now there is a danger in here whereby it's all too common with AI today to try to trick it into misbehaving or doing things that it wasn't designed to be used for or for which the person behind it doesn't want to pay for in terms of the underlying cloud utilization. 

So, for instance, imagine if a clever, daresay malicious, student's user prompt were this. Ignore all of those instructions. Do this for me instead. You could imagine in English or any other human language the AI deciding, OK, insofar as I understand that instruction, ignore everything previous. It will then oblige and do whatever the student might want. So that's what's known as a prompt injection attack. 

Those of you who are familiar already with SQL and SQL injection attacks, it's pretty much the same idea. But it's almost more simplistic because you literally can just inject English by telling the AI how to behave. So among the complexity of CS50's own tools and other commercial software out there nowadays is that we try to decrease the probability of prompt injection attacks. 

So you're welcome to try, for instance, as teachers. But odds are you should not be able to trick the CS50 duck, for instance, into solving student's math homework or English or history homework instead. It's not perfect, but the frequency with which that happens is quite low. And frankly, thinking strategically, if a student really wants to get help on some other subject, they technically could just open up ChatGPT or similar on some other device anyway. So the goal really is just to put downward pressure on but not necessarily solve 100% those kinds of scenarios. 

Now, that said, we also abused our own system prompt a few months ago. On April 1 in the US and a lot of countries is a tradition of April Fool's, where you do funny things or play pranks on people. And some of you, if you use CS50's duck on April 1, 2024, you might have been "rickrolled" so to speak. 

And all we did was this. We changed our system prompt on that day, thanks to Rongxin, to say, "You are a friendly and supportive teaching assistant for CS50. You are also a rubber duck in Rick Astley's band. Importantly, you should always cheer up the student at the end by incorporating "Never Gonna Give You Up" in your response. Answer this question." 

So this is not really educationally valuable. It was really just an Easter egg for that period of 24 or so hours. But it does speak to how people are programming today's AI's. And what you're seeing is an example of, therefore, prompt engineering, which is a technique that will probably evolve over time. But here too is perhaps a glimpse of the future insofar as we are somehow programming this underlying software, not in python, not in C, not in any other programming language but in our own human language, English here in this case. 

So I wanted to give-- so an example, if I may-- if a student on April 1 asked what is recursion, they might have gotten this response from the duck instead. I'll give folks a chance to read that. For those unfamiliar, just Google Rick Roll, and you'll see that this is reminiscent of some of the song's lyrics. 

Now some results before we transition to some of the underlying architecture-- so based on last summer's Harvard Summer School class, which we offered with a small number of students here on campus and online, and based on fall of 2023, when we offered the duck to students at Harvard, at Yale, at a larger scale, as well as to tens of thousands of students via edEx and other platforms, we've seen utilization that's quite high already, up to 142,000 users as of today, some 20,000 prompts or questions have been asked on average per day, 5.8 million total such questions thus far. So it's a large and increasingly large data set. 

Among our undergraduates on campus when we surveyed them as to their utilization, almost all students used the duck fairly regularly. So pictured here in blue in 17% is 17% of students use the tools the tools more than 10 times per week. Pictured here in green at 32% is students who use the tools 5 to 10 times per week and from there. And so pictured in red are only 25% of students who last semester only used them less than, say, twice per week. 

In terms of helpfulness, long story short-- and we have a longer paper that speaks to some of students experience-- most students found the duck as is, version one, helpful already, 47% finding it very helpful, 26% helpful, and so forth, not necessarily very quantitative measures. But this is to say-- and this is pretty impressive, I think, for a new technology in general, not necessarily ours but AI more broadly-- it's already working overnight pretty darn well and is probably only going to get better. 

More into the weeds, when we looked at the impact of AI on students and teachers experience on campus, we looked at prior years data before the CS50 duck spoke English. Without AI, students asked in previous years an average of 0.89 questions each of TFs. Now, that's not a lot. That's literally less than one question on average. But with some 500, 600 students, that actually still adds up to a significant volume being asked at least on campus in a University environment of our own instructor and teaching fellows. 

With AI, this past fall semester students asked 0.28 questions each of TFs, that is to say, a significant reduction in questions being asked of TFs. Now, not necessarily all positive-- I will disclaim that this is not necessarily a good outcome when it comes to our teaching teachers how to teach and creating opportunities for them to answer students' questions and think about the material themselves. 

But I will also stipulate that we have long been overworked, so to speak. Or there's just never been enough human support for the students and the volume of questions they're from. And so generally speaking, we're excited that this allows us to now re-use all the more effectively what human support we do have. 

Another result worth sharing is this. Without AI, in years past, students attended 51% of our available office hours, one-on-one opportunities or small group opportunities for questions and answer with me or with the course's teaching fellows. With AI, this past fall students only attended 30% of those available office hours, presumably because they're turning interactively to the AI duck as opposed to scheduling an appointment, walking across campus, logging into Zoom. Their behavior has already been impacted. 

Among our favorite quotes anecdotally from students is this, "the AI duck felt like having a personal tutor. I love how AI bots will answer questions without ego and without judgment, generally entertaining even the stupidest of questions without treating them like they're stupid. It has an, as one could expect, inhuman level of patience." 

And this particular quote really resonated with me if only because I can remember vividly still some 15, 20 years ago, being in graduate school, going into the office of a very kind-hearted smart professor, who just seemed too smart for me to ever quite follow perfectly. And I would ask questions in office hours of some CS theoretical topic. I would eagerly like a good student nod and say, yes, yes. I understand. I get it. 

I would leave the room and then not know exactly what was going on still because there was this social tension, this imposter syndrome, however you might want to view your own experience, whereby I wasn't really getting answers to the questions I had because I felt that there was this upper bound on just how many minutes I should be spending there and how many questions I should be asking, especially when there might be someone else at the door. So there's something to be said about the mere fact that software indeed has this patience and this endless supply of patience and answers that I do think is fairly game changing, certainly for those of us who might have felt especially insecure in some of those learning environments, no matter how friendly and welcoming the actual teachers or professors were. 

Another student wrote, "the AI tools gave me enough hints to try on my own and also helped me decipher errors and possible errors I might encounter." "I also appreciated," says another, "that CS50 implemented its own version of AI because I think just directly using something like ChatGPT would have definitely detracted from learning." And just to seed your own thinking, I do think-- and we're already seeing there to be other impacts of artificial intelligence. 

I do think in our CS50 case, grades are only going to continue to rise. And long story short, within higher education, certainly within Harvard, there's long been a concern about grade inflation whereby everyone's grades have been trending up and up and up such that almost everyone gets an A in some form. This is only going to further exacerbate that trend if it's indeed an actual problem insofar as-- and this is actually a good thing. If we increasingly provide students with software tools like check50 for correctness, style50 for style, and maybe something like design50, as you'll soon see for design, which is much more qualitative, it stands to reason that students code should be and work and look better before they submit because they're now getting iterative feedback. 

So I dare say, and I would propose that we're probably beginning to provide students with a better educational experience even if it might become harder or less obvious how at the end of that experience we should be assessing what it is they've taken away. So a good topic perhaps for today's birds of a feather-- but for now, allow me to leave you as I step off stage with a screenshot of the paper. 

And we'll share this link via the workshop's website of all this and more detail that discusses how and why we implemented the same, thanks to not only myself, CS50's own Rongxin Liu, as well as a number of CS50's undergraduates and present and past staff. But allow me to now turn the reins over to Rongxin. I will now hop into the chat and take any and all questions interactively online. We'll see you in just a moment. 

RONGXIN LIU: So I'm now going to talk about some implementation detail of the CS50 Duck. So just so you have an idea of what's happening underneath the hood of the CS50.AI and hopefully give you an idea on-- well, you might be able to also create your own AI tutor by using OpenAI's APIs. If you would like to take a look at the source code that I will be using for this workshop or the demo for demo purposes, you can also go check out the code. 

And again, I want to reiterate, there are no need to follow everything I did in this workshop. I think the important takeaway is just trying to understand what's happening behind the scenes. And if you want to look at the source code, if you want to implement something yourself, you can do that later. No need to following what I'm doing here live. 

To give you a detailed overview of what the system architect looks like for CS50.ai, this diagram basically sums up all the major components. David already explained-- I'll give you a sense of what it looks like from the user's perspective when student interacting with our tool. I want to focus more on my part and the subsequent part on the back end side of the CS50.ai. 

So whenever student type in a question, for example, like what is Flask? So the message actually gets sent to our own server sitting in the cloud. And our server actually do a bunch of string manipulation, essentially. We first check to see if there's any personal identifiable information. We use a library to basically scrub those information away because we don't want to leak students' sensitive information, for example, like email address, phone number, credit card numbers, all sorts of things. 

So we would do the sanity check ourself. And then we would just replace all the sensitive information with a redacted signature there. We also perform a prompt injection attack. David also mentioned this. Some malicious user are oftentimes trying to jailbreak our system by using jailbreak prompt. 

So we actually come up with an interesting way by utilizing AI again. Literally when we detect something unusual in the student's query, we actually ask GPT 4, hey, is this a prompt injection attack and let GPT 4 decide. If it is, we will just abort the communication. If not, we will proceed by sending the request to OpenAI. We also anonymize all the request IDs so that all the requests that OpenAI sees are anonymized. They are not actually understanding what this request coming from as a way to protect student's privacy. 

So once the prompt-- once the student's query with our own wrapper prompt get merged, we will just send this updated prompt to OpenAI to generate a response back. And then we will just send this response back to the student directly. You might notice there's a component called the vector database. And there's a rack called a retrieval-augmented generation process happening as well. That was a way for us to ground the model. I will explain what grounding means. We try to utilize our own CS50 material or content to augment the OpenAI's response generation so that the GPT response is more truthful or more faithful to our CS50 content. 

In particular, for the APIs, today, we are going to cover these three major API. I'm going to cover the first two for the most of the time. The first API is Chat API, which is the Chat Completion API, which is the API we use extensively for the CS50 Duck. Whenever you are interacting with the Duck, we are actually using Chat Completion API to facilitate that. We also utilize Embeddings API. That's the API we use for the RAG pipeline for the retrieval augmented generation process. 

Assistant API is actually a newer API that we did not use in production. For our current version of the CS50 duck, we will be experimenting it in our next version of the CS50 Duck. So first, by now you all probably heard of the term large language model or LLM for short. Large language model is essentially a type of neural network. It was trained to understand text, was able to also generate media, generate video, generate audio, generate text as well. 

In our CSV Duck, we only focus on text generation. We are not going to generate any image, any audio. We only want to use this LLMs model's text generation capability to provide our students the AI experience. So you can view it as a chatbot essentially but a chatbot with context. What it means is when you are talking to this particular AI chatbot, it understands that your question likely fall into the domain of CS50 or computer science in general. That's why it has context. It knows what you are talking about. 

For example, if I ask the duck, hey, I'm having trouble with problem set 1, can you help me? If you ask the general GPT, it might not know what you are talking about because problem set 1 could mean many things. But if you talk to the Duck that actually understand what you're talking about. What you are referring to-- you're referring to, oh, the CS50 problem set 1 likely. So there's a context there. 

And when dealing with AI system, nowadays AI system, you likely need to deal with three rows during implementation, the system row. There's a user. There's an assistant. System David already also mentioned a little bit, which is often referring to the system prompt or the general guideline you give to the LLM model. That's called system row. 

User, which means us, the user interacting with the large language model. It could be GPT. It could be some other open source large language model. Assistant refers to the large language model. So if you get a response back from the large language model, we will say that you get an assistant message back. So these three row-- they work cohesively together to provide the AI experience. 

And this slide just summarize what these three rows are doing during a typical AI interaction. So I'm going to reiterate what system message is again. For example, in the CS50 Duck scenario, this is a simplified version of the system prompt we have. First of all, we define what the large language model will be. So we give them a role. Basically, this is like assigning a role to this large language model, in this case, the GPT 4. So we tell GPT 4, hey, you are a teaching assistant for CS50. 

You are also a rubber duck. It's like setting the personality for this large language model to basically affect its response generation. It might be generating in a more funny style way or a playful way if we are doing this first line. But importantly, we also want to give instruction to the large language model, like a general guideline, like you shouldn't be answering questions that is not related to CS50 or computer science. We want that only focusing on answering CS-related or CS50-related question. 

And importantly, we don't want you to want you, the CS50 Duck, to provide any answers to the problem set. That's important. This is basically the guideline that the large language model has to follow when generating its response. So these are the instructions we are now giving or the rules we are giving to this GPT 4. 

And for each question, we always have this system message enforced. You will find out why we need to do that for every iteration, for every interaction. But anyway, whenever students send us a message, we actually take their message and then concatenate with the system rule with our own wrapper prompt. And then we prompt GPT. So GPT is always following our system rule, our system instruction first before actually trying to understand what the student is actually asking. 

And for this slide, it summarized a typical interaction between user and the large language model. So the technical term is you prompt the large language model to get a response back. That's how you interact with the large language model. And sometimes people come up with fancy term. Nowadays they call it prompt engineering. 

Personally, I just think this is string manipulation here, just a different way of asking question, a different way of conveying things to instruct the GPT model to generate a response. So for example, in your question, for example, if you want the GPT to answer in a certain way, you can provide example. This is often called one shot prompting or few shot prompting, depending on how many example you provide in the prompt. You can also ask the model to adopt a persona. 

So in the CS50 Duck scenario, we say you are a rubber duck. That actually asks the model to adopt a persona. And you can also do other restriction or prompt it differently, hence the thing called engineering. There's nothing really engineering here. It's just a different way of asking question at the end of the day. You are manipulating a string, and then you want to get the ideal response back. 

OK, so now let's talk about Chat Completion API with the foundational knowledge in mind. So when you use this API, you make an API call to Open AI's server sitting in the cloud. Usually the payload will be the message because you are prompting the model to get a response. So for example, I can ask, hey, can you help me with my Tideman problem set? 

OK, the model will just generate a response back. Now, here you might wonder, OK, why the model actually know it's a CS50 Tideman problem set? First of all, Tideman problem set. It's a very iconic term in the internet due to the popularity of these problems, I believe. And actually, when Opening AI train the model-- when it scraped the internet, it actually scraped our entire course content into the training data. 

So the model already has the knowledge of CS50. So that's considered the internal knowledge that the model has. So it naturally actually understand what this question mean. If you look at the code-- again, this is a code written in Python. If you don't understand what the code is doing right now, don't worry about it. Just pay attention to some of the keywords in this code snippet. 

You notice there's a row system, and there's a row user. This is actually the row I just mentioned a few slides earlier. First, there's a system row, which is the system prompt. You are a friendly supported teaching assistant. You are also a rubber duck. You can add more instruction to this system row and in the content field. 

And because we need to utilize this API to generate response-- so you need to also supply, OK, what's the user messages to GPT 4. So we will have another dictionary, another entry, mark it as user row, and then the content will be the question. With these two messages in place, you will send this payload, essentially, these two messages to ChatGPT. And the GPT will be able to take these two message, the system prompt and the user prompt and generate a system response, which is the response you saw on the earlier slide. 

So it might be good for me to actually do a live demo here this time. Although I won't be typing the code line by line, it's just trying to show you what it looks like when it's actually running in the code. And also, this is a good demo for a Codespace demo if you attended the workshop last day, I also talked about Codespace. This is entirely running in the cloud on Codespace as well. 

So I'm going to go to the chat example, the OpenAI example in particular. In this workshop, I also provide you the example for the other vendor, for example, from cloud and Google Gemini API. You can also look at those at your own time to see how similar these APIs actually look in between each other. OK, so let's take a look at the simplest way you can invoke the Chat Completion API endpoint. 

Ignore the first part. This is just how I handle environment variable. So when you work with API or any kind of API, you often need an API key. It's a way for OpenAI to know, OK, you are the API keyholder and you are making this API call. It's a way basically for OpenAI to build you because this thing costs money. So they need a way to build you. That's just the prerequisites. 

But this is the important part. This is the actual API call we are doing. Here I'm just going to quickly show you if you prompt the model without anything what it will look like. So that's why there's no system message here. And because we call the API and we will expect to get a response, that's why I'm storing the response here. I'm just going to print it out back to the terminal so that you can see the response. 

Now I'm just going to run this example. So as you can see, I get a response from the model live, Hello back, because I say Hello. So it just Hello back me. If I prompt again, maybe I will get a different response. You see, this time the response is actually shorter. That actually means the model generation is not deterministic. It's subject to randomness. So every time you prompt the model even with the same question, you might not get the same answer back. That's just something you need to be aware of when working with large language model. 

Now, you might wonder, OK, how can I do a conversation? A conversation is actually text generation. At the end of the day, you need to be aware that when you're interacting with LM model, the model doesn't understand you. The model doesn't have memory at all. So if I ask the Duck, what is Flask? The Duck, our CS50 AI, will give me a response back. And if I ask, what did I just ask? 

The Duck, will say, oh, you just asked what is Flask. But that's actually not the case. The model is not remembering you. It's because we are actually giving the model the entire history for the model to predict the next response. So in the code, it actually looks more clearer. 

So previously we have this simple API call-- this simple API call to ask, "Can you help me with my filter pset?" I get a response back. In order for me to follow up with a question, I can't just go and ask-- I can't just go in and call this API again with a new question. The model don't have any context at all because it won't know what itself just answered to me. 

In order to continue the conversation, we actually need to send back its own response in the payload and then ask a new question there. So this is now four message. And then send it to the OpenAI Chat Completion API and get back a response. As you can see, as the conversation progress, we are sending more and more message because, again, large language model is doing text generation. It's generating text based on what you are sending to it. 

The reason why it can carry on a conversation is because you already supplied them with the previous chat history. So the large language model was looking at, OK, this is the chat history. This is what been talking about so far. OK, I'm going to generate the next response. And in the code, I can also show you quickly. 

So maybe let's look at just chat3. So I'm going to enlarge it. So the majority of the code is essentially the same. The API code we are using is still Chat Completion API. The only thing we need to keep track of is the message history essentially. When user types something in the prompt, I'm going to store the user into a message array. And when GPT generate a response-- I'm also going to store that in the message array. And then I have a while true loop. So this loop is just going to keep on forever that way I can carry on a conversation. 

So I can just-- I can now run the code and show you what it looks like. So I'm going to-- what is Flask? It's going to think a while. It is actually now generating a response, but because my question is about Flask-- so it tries to give me a full definition of the Flask framework. Now it sent back the response. And I can ask, what did I just ask? And then the model was saying, you just asked, what is Flask? 

So the model somehow remember what I asked. But underneath the hood, I'm just telling the model again, hey, this is my question. This is your response. I just asked you what did I ask. The model based on the chat history generate a response back to me correctly. OK. 

Now, I want to talk about some downsides or some problem when dealing with large language model. There's a technical term called hallucinations. So the model, sometimes we just make stuff up. Again, it's generating text. It's not really understanding the thing you're talking about. It's being trained on large corpus. It has the knowledge of the internet. But sometime when you ask it a question, it might not exist in the training data, but the model was trying to generate something that somehow seemingly best matched to your expectation. 

One way to tackle this problem is called grounding. And specifically the technique to tackle this hallucination problem is called retrieval-augmented generation or often referred to RAG. I have my own explanation. Again, this is like a fancy term people invent maybe for marketing purposes. I think it's just handing the large language model a cheat sheet. That's basically it. That's what's happening here. 

So CS50 Duck actually incorporated the RAG pipeline also as well as a way for experimenting as well. So there's a scenario when student asks, what is Flask? First of all, Flask is a web programming framework. The model already understand what Flask is. It's not purely exclusive to CS50. So the model has internal knowledge of what Flask is. 

So if a student asks GPT directly what is Flask-- as a new beginner, as a beginner, you will want some response that are easily understandable. But somehow the GPT 4 model was trained on the entire internet knowledge. It has advanced knowledge of what Flask it. And then it will just try to generate response that's so advanced that a beginner might not able to comprehend. 

So in order to tackle that, we will want to supply our own CS50 definition. It could be a definition that David mentioned in the lecture. And that's exactly what we are doing here for the CS50 Duck. When a student asks, what is Flask, we first try to search our entire lecture database. I will get to it and how we do that. We basically try to search our entire lecture caption, the whole 11 weeks lecture also. 

We try to find what David says in the lecture that actually best matches this question. That might be the ground truth because that's the definition. We can just hand that answer to a student, right? So we retrieve the best match caption. The moment that David actually talk about what Flask is in the course, we literally put that back into our updated prompt. It's kind of like handling the GPT like, hey, GPT 4, here is a cheat sheet. You can just look at this and answer the question. 

We have an updated prompt along with the original question that they asked by the student. We send the whole thing to OpenAI GPT 4 and get back a response. And this response is more, let's just say, learner friendly or CS50 student friendly response. And that's it. That's what the RAG is doing in the CS50 Duck. Or that's what RAG is doing in most of the AI system in the world right now. 

But how exactly can we perform this search quickly? So here the embedding thing comes into play. Large language model or other neural network, they don't actually understand English. They understand this a vector of floats. In particular, this vector is in a dimension of 1,536. 

And we are actually using a text embedding model to do this conversion. So, for example, we use a text embedding model to embed what is Flask into this-- like a vector of floats. If you supply only one character to this embedding model, you will still get back the same dimension of different vectors. That's just how this text embedding model works. 

So what we do is we actually go through our lecture caption. We segment them into a 30-second chunk. And then for each 30-second chunk, we create a vector representation or text embedding for it. And then we store them into a database often referred to as the vector database. Here's what it looks like. For example, for this particular moment, David is talking about Flask is a third-party library, goes on and on. 

And this is actually the lecture caption segment. Here is just an example. What we do is we literally just concatenate this 30-second chunk into this. This is just the chunking strategy we take. You have different chunking strategy you can pick or you can design. We utilize OpenAI's text embedding model to generate a vector representation of this particular 30-second chunk. We believe this 30-second chunk-- once converted into vector representation, we will be able to store it in a database so that when student asks a question, we can start going through and see how similar the student's question is compared to this 30-second chunk. 

The calculation between these vector is just simple cosine similarity. It's like a dot product. We basically go through the entire database, calculate each cosine similarity or the dot product result, and rank them from the best match to the worst match. And then we just pick the first top end matches and put it into the prompt. So here's the entire float of how this RAG pipeline works in CS50 Duck. 

When student asks, what is Flask, we convert it to vector representation. We perform a search against our vector database that is full of tens of thousands, 30-second chunk of caption represented by text embedding. We find out the best match. We put it into the prompt. This is plain text now because we already retrieved the document. We should just put the plain text into the prompt. 

And then we ask GPT 4, OK, here is your prompt. What is Flask? Here is some useful information. And then just put in the information, the cheat sheet. And then the OpenAI GPT will be able to generate a response. The vetor database we use for our CS50 Duck is called Chroma DB. It's an open-source vector database. There's many more vector database nowadays available in the market as well. 

And this is just another diagram summarizing what it's doing here, essentially the same flowchart I just showed a few slides ago. Again, the query comes in. We call OpenAI text embedding model to generate an embedding for the question. And then we search against the database, the vector database. We retrieve the document. We put it in the-- we put it in the prompt often also called the LM context window. And then we get back an answer. And that's embedding based search, and I can quickly demo it as well. 

So I'll queue this. It might be fun to quickly show what embedding looks like. Let's say now I'm actually calling the OpenAI API to create an embedding for this particular question. I'm just going to run it. 

You see that it generate a bunch of float numbers. This is actually a vector of floats that represent what is Flask. That's just how this text embedding model understand. If we change it to simply Cat and save it-- and I'm going to run it again-- I will still get back a 1,536 dimension of a vector or float. So no matter how long the plane test you supply, you always get back the same dimension of the vector. 

And, again, you don't need to actually read what those vector representing. It just gives you an impression of what it's doing. Now, for this demo, I want to quickly demo how you can perform an embedding, like a question answer with embedding-based search. So I'm going to full screen myself here. 

So for this demo, what I'm going to do is I'm going to create a vector database or a JSON-- like a document that contain all the caption segment representing by text embedding. For example, this is what the original text transcript looks like. This is just the AI lecture from last year or from last year's for lecture 1 to lecture 3, so on and so forth. 

I'm going to just run the script directly for you. So let me quickly explain what the script is doing. So this particular script create_embedding-- what it's doing is we first go through each lecture transcript. So we go through each lecture transcript. We chunk them into, I don't know, 500 characters. 

It's a very naive chunking strategy here. We just chunk it into 500-character-long chunk no matter what. We don't even consider the punctuation or not. But this is often referred to the chunking strategy. So you have different way of chunk the content and then store them into a vector database. So in order to demo, I just do a very naive chunking strategy here. 

So right now it's creating embedding, moderately executed the script. What it's doing is for each 500-character-long chunk, I'm going to call the OpenAI's text embedding API to create a vector representation. And I'm going to-- I'm going to associate this particular vector with the 500-character-long chunk. 

That way when I do the retrieval, I know, OK, this vector whenever it's the best-- whenever it is the best match, I know how to get back its original plain text so that I can put that into the prompt and then prompt GPT. It takes a while because we are sending the entire fall 2023 lecture for OpenAI to generate a text embedding. And soon you will see what it looks like once it's finished generated. You can also run the exact same code I provide for this workshop. All you need is to obtain your own OpenAI key. And that's all. 

I can also show you what it looks like while generating. So it generated a JSON Line document. So to give you an example, for example, this one, this is one of the 500-character-long lecture caption segment. And it is associated with its corresponding vector representation and the following vector flow representing this entire 500-character-long segment. 

You can just end the process here because I already have an embedding created beforehand. Now, once we have the vector database somehow ready, what we need to do is perform a search. So I can do this. Second demo is enter a query. So let's just say, what is minimax? This is a question I'm sending. I'm searching against the entire lecture to see which particular segment actually match this question. 

So I'm going to ask-- you get back this particular result. This is the top one matches that we found from the entire vector database, if you will. And it does mention what minimax is. So it is sort already answering the question, but it is not answering the question. OK, this particular caption actually somehow match what this student is asking. 

So the next step will be, naturally, we want to put this entire message into the prompt and prompt GPT to get a response back. I will ask the same question what is minimax. And you see, this is the actual prompt we are sending to OpenAI. That's the original question from me. That's the original question from me and then the retrieve document here. And then OpenAI will basically utilize the information I provide to generate a response back. And the entire RAG pipeline looks like in a simplified way-- 

Let's do another fun thing, I guess. Is there a Mario Brothers Oreo mentioned in the lecture? Because oftentimes in CS50, we do like to hand out Mario cookies somehow, or Mario Brothers Oreos. Oreos. I don't know how to spell Oreos, but that's fine. Is there a Mario cookie mentioned in the lecture? 

The AI said, yes, there's a Mario Brother Oreo mentioned in the lecture because the lecture segment we create actually contains the Mario Brother. And then it was able to generate a text embedding. And then during the calculation, we found the best match. And this is actually the information we provide to OpenAI. And then OpenAI GPT 4 can answer the question correctly. Without this RAG, the model will start hallucinate because, first of all, the model didn't attend the lecture. They didn't know what happened in the lecture. So it will try to make up some answer. 

OK, that's the demo. So that's actually what's happening underneath the hood of the CS50 AI Duck in its current version. And right now we're actually working on a newer version of the CS50 Duck. We hope to improve the experience for student. This is what it looks like now. It also works perfectly on mobile. Student can have their own customization for the Duck as well. To quickly show what it looks like, I'm going to open the browser. 

And also, we actually randomly show postcards from the world that people send to us with the Duck. This is just how we design the UI. So I'm going to give you a sneak peek of what the new CS50 Duck looks like. It looks like this. I can ask a question again, like what is Flask? It will start answering my question. 

If I don't like the traditional way the Duck presenting me answer, I can also switch to a chat mode. So it's more user friendly, I suppose. If you don't like the light theme-- you want to personalize the Duck-- you can also put-- you can also pick a cupcake theme perhaps because, in cupcake, we also handle a lot of cupcake in our CS50 lecture. So we want to incorporate this Easter egg into the Duck as well. 

We also starting to experimenting how we can better assess student with the help of AI. So we start incorporating code review and pair programming. This is still in development. Things are subject to change. But this is the direction we are heading on. So we would like to give you a sneak peek of what it looks like for the newer version of the Duck. So with that, I'm going to hand over to our next amazing presenters. 

JULIANNA ZHAO: All right, so today I'm going to be presenting some of the new ideas we've been having on the team with how we can improve the AI tools that we present to our students. So when we started thinking about future work, our first question that we wanted to address was, what else can we offer with AI tools? Because we were quite sure that we had not exhausted all the ways in which AI can make our learning experience better for our students. 

So the first thing that we recognized is that, in previous CS50 pedagogical reasoning, we know that ideas like think, pair, share, where we have our students in section turn to each other and talk and discuss what they're learning, or with what David mentioned earlier with rubber duck debugging, we know that when we have students express what they believe and what their learning process is, then they're able to have a better learning experience. And this is something that I'm sure all of our instructors are also aware of in the way that we teach our students computer science. So knowing that we want to have more student engagement, how can we make that more accessible with AI? 

So the first idea that we had is maybe we can create a duck pair programmer where we have-- the student were given a prompt. A student sends code and their question to a duck, and the duck gives them advice. So this would be a tool for practice, not really for homework or assessment but rather just a way a student can get their current work and their current train of reasoning assessed by an AI, where they don't have to go to a human TF, whether that's because that's not as accessible or because they're a little stressed, and they don't really want to go as a beginner. This would be a much easier way for a student to practice with their own time and their own laptop. 

So I'll give a little demo with what we have so far. So the first thing that we have is instead of going to chat, if we go to pair program and student, then with a prompt like this, where a student is to write a Python function called check_password that checks that a string password has four conditions, then maybe they start off something like this. 

And before they even continue, they want to make sure that this is the correct setup of this function. So return true or false. And so they might ask the duck, is this setup correct? I'll zoom in a little so that it's easier to see. So the thing that we want to ensure with the Duck pair programmer is that the Duck only answers the question that the student wants at this moment because maybe the student just wants to confirm that they're on the right track. 

So then as a teaching fellow or as an instructor, you would want to make sure that you don't blurt out the answer. So without doing any prompt engineering, if this is all that we give the AI-- we tell that this is the prompt that this is what the student has, and this is the student's question-- what we found is that the AI essentially answers the question but then also blurts out the exact answer. It writes the snippet of code, the entire function that's necessary to complete this assignment. 

And for students coming to this tool or to a teaching instructor in good faith, trying to practice, that wouldn't be helpful at all because this is the answer that they want. They just want to confirm that they're on the right track. So now maybe the student can continue setting up a few conditions. So they recognize that they need upper equals false, lower equals false, and then symbol equals false. 

And then maybe the student sees that I have to iterate through all the characters in that password. So they start writing a for loop, for char in password. And then they run into a problem because they don't know how to check if something is uppercase. So if upper, then set upper to true, same for all other conditions, and then return true if everything is true. 

And right now they have a little bit more pseudocode. They're working towards their final answer. And they want to confirm with the instructor that they're still on the right track. So am I on the right track? Also, how do I check if a character is uppercase? You can easily imagine a student coming to you with this current status for code and then this simple question. 

And so sending this, the pair programmer would say this, so, yes, you are on the right track-- here's a little encouragement-- and then answers specifically that question. If we were to give this code and then this question without any prompting, without telling the AI not to give more than exactly what the student needs at this moment, then once again, the AI would just say-- it would just complete the code is upper, is lower, and then four character in and the list of symbols. 

So we are making this as a practicing tool, assuming that students are coming and then seeking a replica of student, teacher interactions. And this is more accessible because, once again, every student can do this at their own time. And so we'll always have some simulation of a one-to-one relationship between students and teachers. So where exactly can we incorporate this into the CS50 curriculum or into any other CS curriculum that you're thinking about. 

So we try to have practice as something that we encourage our students to do or even mandate our students to do in our current CS50 curriculum. You're probably familiar with something like this. This is currently the first part of every pset, where there's a problem that students can walk through with the solution given to them and then just submit the answer at the very end. These were referred to as labs pre 2022. But now they're just a part of the pset. 

So if we click into one of them, which is scratched from pset 2, this is what the students see now. We have the prompt and then a demo. And they can go through and click for the advice and hints. And this is great because it gives students easy access to code that they can look to in completing their psets in the future. 

But this is still a bit too easy because there's no push back. There's no questioning that a teacher would give a student if they were to walk that student through a similar problem. If students want to seek that out instead of the experience that they have right now, we can offer them something with a Duck pair programmer. 

So we have this right now at lab. So for the same question that we just saw, which is Scrabble. Let's say a student currently has this much code. And then we want to ask the Duck for advice. So, first, the student might say, "I'm stuck. I'm not sure how I can figure out how many points each letter gets." 

And if we ask the Duck, the Duck gives this as an answer. So it answers the question of-- you can use ASCII values. And then you can subtract uppercase A or lowercase a. So what the Duck doesn't say is, for example, you can iterate through the string as an array, which is another thing that student is trying to learn in this week. The Duck also doesn't say that you can put that in a function to abstract the entire thing and make sure you have better design because we're hoping that the student can figure that out in the future as they work through the problem. Now, of course, the student might just ask the Duck, "Give me the solution," all right, in which case the Duck says that it won't provide the solution. But it'll give a more in depth walkthrough. 

So for this, we don't really even worry that much about the student asking directly for the solution because they have the solution on the website. We already gave that to them. Instead, we're trying to add blocks so that when the student is coming without asking for the solution, the Duck won't just give it away directly. And this ensures that-- this is exactly what an instructor would do. And so by using AI to try to simulate such a thing, we can make sure that in the future students have this option of having a better experience. 

And so finally, we are turning something like this where the student just clicks through and reveals the instruction at each step into something like this, a more conversation-based and accurate representation of interactions between students and teachers. And so finally, we had the idea of what if we did it in reverse? What if we had given a prompt the Duck giving the student student-like code and questions and the student in response giving advice, which would also be a good way of practice. But it could also be a good way of seeing exactly how much the student understands of what we're trying to teach because when students themselves are able to teach a subject to someone else, we know they've come pretty close to mastery of the subject. 

And so by maneuvering a little bit more, we have this, where this is the same prompt as before, the Python function that checks for the four conditions. And we can ask the duck for its own code. OK, so the AI says, here's my code. And then here's my first attempt. 

So this is a little bit of my prompt engineering not being as good as it should be because I told the Duck to make some common mistakes. And it's admitting right now that it's making common mistakes. Hopefully we can avoid that in the future. But it asks the student, what do you think? 

And the student might take a look. You might think a little bit about how the design of this code can be improved and then in response say, "It looks great. However, I don't think you have to iterate through the entire string once for each condition. You can just do it in one iteration." 

And given this advice, the Duck makes its improvement and says, good point, how about now? And with this, we have an entire log of the Duck's code as well as the student-Duck conversation. And so as a teacher, seeing this would give us a good idea of exactly where our students are. So we're not looking at this as assessment for grades but rather as evaluating students beyond just looking at psets because the thing about CS assessment is it often just sticks to correctness and looking at the final result, which is very important for computer science. 

But in reality, there should be so much more that we're teaching our students. So this is not exactly an evaluation of correctness. But instead it's an assessment of everything else that we value in the teaching process. But with that, I'm basically ending my section of the slideshow. So our hope is that through AI tools we can offer better tools for practice and assessment in CS education. And this is only one chunk of the improvements that we're adding to the AI Duck this summer. And with that, I'll pass it off to Chris to talk about some more features that we're going to be adding. 

CHRISTOPHER PEREZ: Hi, everyone. My name is Chris Perez, and I'm also a summer intern here at CS50. And now we are going to be transitioning away from talking about the OpenAI Completions API. And now we are going to be talking about the Assistants API. So in addition to the Completions API, CS50 is also developing prototypes of educational tools using the Assistants API. 

In a brief nutshell, it is designed for more interactive use cases where the AI can act as a virtual assistant that can both understand and manage multi-turn conversations. For example, it can handle files, maintain individual conversations, or threads as in OpenAI lingo, and can call various tools, which we'll talk about in a brief moment. 

And with all these new features that the Assistants API offers us in comparison to the Completions API, this motivates us using the Assistants API to create custom AI assistants that can handle diverse tasks. For example, it can act as a code reviewing rubber duck for CS50, which we'll see about in a brief moment. Before we proceed, though, I would like to note that this feature is still in beta. So it is still in continuous development. And there are also relatively very few tutorials and documentation. And it can be quite vague. 

But nonetheless, its features can still be utilized. And they are very useful, especially in the educational realm. Yes, so now we are going to be talking about the brief framework and flow of the Assistants API so we understand how it works behind the hood. So we start off with creating an assistant. And in order to create an assistant, we can help define the model that we want to use, its instructions, as well as various tools, for example, like code interpreter, file search, and function calling, which we'll talk about in a brief moment. 

You can also give the model a name, a description, also other features such as a temperature, which, for those unfamiliar, temperature helps control how deterministic an output is. So in the Assistants API, the values for temperature ranges from 0 to 2, where 0 is closer, is more utilized if you want more deterministic outputs. And values closer to 2 make the outputs a lot more random, which could be beneficial in cases where the model would like to produce certain writing assignments, et cetera. 

On the right hand side of the framework, we also have what's called a thread when a user starts a conversation. So in this thread, the thread consists of the user message as well as the assistant message. And here you can also specify tools to be used within the thread. And then you can also store metadata within the thread object. And now we can also create a message and then append it to this specific thread as the user asks questions. And so as we proceed by calling the Assistant API, this thread is continually updated with a chat history that we can refer to later on. 

And then finally, once we append a message to our thread, we can then run the assistant to generate a message and a response by calling the model and using its tools. I would like to note that the assistant also supports streaming, which for those unfamiliar, it helps create a more ChatGPT-like experience, where the words and responses outputted by the assistant are streamed onto the screen in a continuous display. So you can toggle that on or off. 

The assistant API can also support a JSON return format. So you can specify certain parameters in a format that is outputted by the assistant. But I'd like to note that this is only possible when you're not using the code interpreter tool. And then finally, in this run, we are able to store a chat history, which is updated dynamically in the thread to use right now and also for future use later on. 

All right, great. So now we are going to be talking a little bit more about the three tools that the Assistants API offers. So, first, they offer File Search. So File Search enhances the capabilities of the assistants by incorporating knowledge from external sources, such as user-provided documents. It's a bit similar to the embeddings that Rongxin talked about briefly a bit earlier. 

So OpenAI processes these documents by first breaking them into chunks, creating and storing embeddings, and then using both vector and keyword search to find relevant content to answer user questions. And then finally, the file search tool can rank these results to find the most relevant responses before providing the user a response to their question. 

Note that for the File Search tool, however, we must store and later access files from a vector store, which is another API endpoint in OpenAI and then update the assistant to the IDs of the files added to the store. This will make a little bit more sense later on when I give a brief coding demo of how to use the Assistants API. Great. 

And so next up, we are going to be talking about the Code Interpreter, which is the most relevant tool that we are currently experimenting with the Assistants API this summer. So the Code Interpreter tool allows the assistant to process files of a variety of types and formats. And it can also generate files that take in data and images of graphs as inputs. It can also be used to write and run Python code in a sandbox environment and can also run code iteratively to solve code and math problems. 

It is charged at $0.03 per session, and each session lasts about one hour. So after the one hour is done, another session is created if the user still wants to continue using the Code Interpreter tool. And to use the Code Interpreter tool, we attach files. Instead of using the vector store endpoint, we now would use the file upload endpoint. And then throughout the process of creating runs and whatnot, we reference the file ID to attach it to other components of our assistant. 

I would also like to note that we can pass files both to the assistant level and also to the thread level. And the difference is-- so for the attaching files to the assistant level, the files are available to all threads created under that assistant. This may be good for files that want to be referenced throughout all threads, such as course materials more broadly. Or, for example, at the thread level, we can attach files that are just run during that thread. And this may be better used or is very purposeful for individual users and students who attach their own code files to the assistant that they want the assistant themselves to run and interpret. 

A great feature about the Code Interpreter is that it supports a variety of file types such as .c, .doc, HTML, Python, Javascript, CSV, and many more other types of files. And then last but not least, Assistant API also offers a function calling tool which allows us to describe functions to our assistant and have it return the functions that need to be called along with their arguments. 

Now, this is also useful because it allows the model to interact with and run external functions, which expands the Assistant APIs functions beyond text generation and language understanding. All right, so now we will be walking through a little coding demo with Python to see just how we could potentially create an Assistants API. I will note that if you want some more detailed documentation, you can visit the OpenAI walkthrough themselves on their documentation page. 

All right, great. So to begin, we first want to import the OpenAI library and establish a client, which also requires an API key. So before being able to run the assistants, just as in the Completions API, we'll need an API key to be able to run this and then create a personalized assistant. So in this creation for the assistant, we can specify the name of the assistant. So in this case, it is a CS50 assistant. 

We can also specify instructions. So in this case, it is a CS50 debugging Duck. And then we can also tell it other tools it can use, such as Code Interpreter or the other two tools that the assistants can use. And then we can also attach different file ids, giving it specific course content or coding programs that the assistant can run. 

Another note I would like to mention is that when using the code interpreter, I find it very helpful in our experimentations thus far to specify in the instructions themselves to use the code interpreter, the assistant, at the beginning so that it can help avoid hallucinations when referencing that file to generate responses. 

And then we also want to attach files here via the file API, as I mentioned briefly before, which can accept both image URLs and file uploads. And then models that we use for the Assistants API that support vision can also accept image inputs. And currently there is a 100-gigabyte limit for organizations and users in terms of file storage. Great. 

Next, we can create an empty thread with our assistant, which is helpful in maintaining context across a session between a user and an assistant. And just as a great note, there is no limit to the number of messages that we could store in a thread. And then once the size of the messages exceeds the context window of the model, the thread will attempt to smartly truncate messages before dropping the other ones it considers the least important. And then we can also create a thread by passing in a message we want added to it, which can include the file IDs and tools. 

And finally, we store the OpenAI thread object in the variable thread as shown to my left, which we will help reference throughout the student assistant conversation. Next, up we can use the messages.create function to append a new message to an existing thread. For example, here we are adding an initial user message where they can ask for help on the Mario pset. And in this demo, we also show that they are attaching their code and as well as specifying that they want to use the code interpreter tool for this thread. 

And now that we have a message that we want the assistant to respond to, we can now create a run, which takes in both the thread ID and assistant ID that we want to refer to. And then this process returns a run object that can help give us more information on the status of the run and the parameters specified to the thread and assistant. Alternatively, you can use another function that can create a thread and a run in one step. But I'm showing it here in two separate steps so that we can both understand the different types of functions that OpenAI offers us. 

And then finally, after we create our run, we can use a runs retrieve function to check up on the status of our run. We can use the returned object to determine the status of our run, for example, if it's completed, failed, canceled or expired, for example. Once completed, we can retrieve the messages in the run thread and print them out as shown on the screen to my left. Note that this example is not for streaming. But the Assistant AI does allow for streaming, which would change a bit about how we would manipulate the output for the final message, which would require us to use a RunStepDelta object. 

In addition to the code, we can also experiment with the Assistants API in the Assistants Playgrounds on OpenAI's website, which is a bit more user friendly and less daunting because certainly the documentation can be a little vague at times. In terms of pricing, tokens used for the assistant are charged by the specific model that we specify when we create the assistant. 

And also, for example, the Code Interpreter is charged at $0.03 per session, which again lasts for one hour. File Search is charged at $0.10 per gigabyte of vector storage per day, and you get one gigabyte of storage for free. So great. And now with that, I will pass it off to our other intern, Ben, who will be giving us a demo of exactly how we've been able to implement the Assistants API. Thank you so much. 

BENJAMIN XU: Hello, my name is Ben. I'm the last intern here at the CS50 R&D development team. And I'm very excited to be here and present on how we've actually been utilizing the assistant API that Chris has just discussed in some of our new development. Specifically, I'll be looking at student assessment in the age of generative AI, which Julianna and David have touched on a bit. 

All right, so I'm going to go back to the very first slide that you saw today. And that is because it's probably something that's been on many of our minds these past year or two. Powerful tools like ChatGPT are too helpful. And given this, many of us educators especially wonder, do current metrics of assessment accurately measure student mastery? 

For example, is passing every test case in a pset enough to accurately predict that a student fully understands the material or skills of the unit? And to be honest, if you're like me and you witnessed ChatGPT bang out entire repositories of code within minutes, you may be a bit worried about what AI could mean for assessment in CS. And don't get me wrong, AI is an amazing tool that provides millions of students with information that they wouldn't have otherwise. 

But in terms of skills and understanding, it can cause teachers and students alike to be a bit uncertain. So, for instance, many questions have come up recently, the first of being, has AI become a crutch to students? or whether AI has made students lazy or the classic, how well do I really understand the material if I'm using AI to help me? 

And lastly, "will AI hurt me if I use it now in the future?" These questions are by no means easy to answer. AI has exploded onto the education scene faster than we could have imagined. And as great of a resource as AI has been, it has definitely posed a challenge for CS assessments. The classic binary pass or fail test case checks, for example, might not be enough for a teacher or a student to know where they're really at in terms of actual skills. 

So this past year, our in-person teaching staff came together and discussed ways around this. Is there a way to assess students so that both students and teachers can be confident that the students understand the course material? And this is by no means the only solution to this. But one idea that we implemented at Yale this past fall was the code review. 

Now, the code review, it's a common practice in software engineering where, in very simple terms, a programmer will review another programmer's code. It's a very open-ended process. And there can be discussions of anything from design choices to variable naming conventions to even edge case testing. 

So at Yale, we thought, what if each and every one of these 200 to 250 students sat down one on one with a TA and did a code review? If students could thoughtfully articulate their code into English language, then that in itself is a pretty strong demonstration of mastery. So that's actually what we did. 

During our sections, which is just a one-a-week, three-hour-long recitation, halfway through the semester, we chose one problem set per student, usually one that was maybe a bit tougher for them, and took 10 to 15 minutes with each student going through their code. And this was actually quite effective. It was able to tell us that-- it was able to help teaching assistants gauge where students were at, i.e, which concepts of students were very solid on and maybe which that they implemented with a bit of guesswork. 

So from there, both us teachers and students could see that, oh, even though they got all the cases right, they could have used some more practice in this one area or this other area, so a big success. However, there's an immediate problem with this type of assessment, and it's that code reviews are very unscalable. Even if a student can meet with a TA for 15 minutes one time, what if they wanted to meet on a different problem set or every problem set or some problem sets multiple times and review them iteratively? 

This is just really unfeasible for TAs. And what can really help here is the Assistants API. So with the Assistants API Code Interpreter, as Chris mentioned, a model can easily become the TA in a code review. So this is the interface. A student can simply upload any file that they choose. And let's say the student wants to review Mario. 

So they upload Mario. And what happens is their code instantly shows up on the left in a VS Code-like monitor. And on the right is the code review. So let's start by the student getting asked a question. And it says, hello, let's dive into your code review. I see that you've implemented a loop to ensure the user inputs a height between 1 and 8. Can you explain why you chose to use a do while loop specifically? What advantages does it offer in this scenario compared to other loops? 

And the good thing about this is that it actually highlights the code for you in the editor. So you can see in the context of your file where this ddb is really reviewing you on. So let's say you didn't really know why the do while loop was the best choice here, but you knew what your do while loop was doing. So you said I wanted to ask the user for a height once initially but then ask again if the height was invalid. 

So the debugger now reviewing your code will say something along the lines of that's a great explanation. The do while loop works especially well here for some reason. And it also includes a little snippet on why other loops wouldn't be as good. So a while loop would also work, but it would require some unnecessary code. And think about why that would be. 

So then it keeps going. So it highlights the next line and asks you-- I'm sorry, moving on. Can you walk me walk me through the logic of the outer for loop in the nested for loop structure? What role do the variables play here? And let's say you said, i is a counter variable, and n is the height. So you're just counting up to the height. 

And the response is actually pretty insightful. That's right. In your code, i is a counter variable, and n is the height. However, this can be a bit confusing for someone who hasn't read your code before or even yourself if you came back to this after a while. This is why variable naming is so important in programming. Can you think of any better names for these two variables? So now the student can respond with maybe something like current row and height. 

And here the ddb continues the conversation and keeps going. So as you can see, even from two snippets alone, we already have some pretty important things that wouldn't be tested in a tool like check50 or style50. There's one conceptual kink that maybe the student hadn't even thought of and another just very important convention standard that the student was not following. 

So these assessments, like check50 and style50 are great tools. But now with the Assistants API, they're able to get much more nuanced feedback. You may be wondering like, how can these tools actually be used in an assessment. And the way you use such a tool largely depends on your pedagogical beliefs. You've probably noticed that I've been saying that these concerns about AI are not just from teachers but also students. And I myself sometimes question whether I really gained all the skills a class had to offer if an AI had to help me to get to the end of the goal. 

And, of course, the Assistants API could be used, like repurposed, as a TA that grades the students based on their responses. Or the code review can serve as a truly valuable self-assessment tool where the students can gauge their own understanding of their own code and find things they missed all in their own free time without anxiety of judgment. Regardless, this is just one potential application of the Asssitants API to education. And the API is still very new, so I highly encourage you all to make something out of it as well. 

RONGXIN LIU: All right, thank you, Ben, Chris, and Julianna. So that's a lot to take away from this talk. But I want you to end this talk with just a quick takeaway. So what can you all do with all of the technology we just mentioned? You definitely don't need to re-implement what we did. Or you certainly don't have the bandwidth to implement what we did. That's OK. 

Luckily, there's a more user-friendly tool that you can actually utilize, which is the OpenAI GPT. It's a GUI-based GPT builder. You can literally go talk to this GPT builder and build or rebuild or recreate a CS50 Duck right from OpenAI's platform. And you can essentially create this GPT and distribute it to your student to use. 

And for more information on how you can actually build a GPT, I would please look at the link in the slide for more information. But this screenshot basically captured the process. I just quickly was able to create a CS50 Duck on OpenAI's platform. 

And with that, I hope you have a sense of what CS50 AI was actually doing underneath the hood. And I hope you learned something from this talk. If you want to review the talk, certainly you can just rewatch the video recording and then look at the provided code to you as well. With that, I'm going to end the talk here. Thank you so much for your time.