Inside ChatGPT, AI assistants, and building at OpenAI — the OpenAI Podcast Ep. 2

OpenAI Podcast: The Early Days of ChatGPT and Beyond

Andrew Mayne: Hello. I'm Andrew Mayne, and this is the OpenAI podcast. My guests today are Mark Chen, who is the Chief Research Officer at OpenAI, and Nick Turley, who is the Head of ChatGPT. We're gonna be talking about the early viral days of ChatGPT. We're gonna talk about ImageGen, how OpenAI looks at code and tools like Codex, what kind of skills they think that we might need for the future, and we're gonna find out how ChatGPT got its totally normal name. Even half of research doesn't know what those three letters stand for.

Nick Turley: You know, you're gonna have an intelligence in your pocket that it can be your tutor, it can be your adviser, it can be your software engineer.

Mark Chen: There's a real decision the night before: Do we actually launch this thing?

The Name Game: How ChatGPT Got Its Start

Andrew Mayne: First off, how did OpenAI decide on that awesome name?

Nick Turley: It was gonna be "Chat with GPT-3.5," and we had a late-night decision to simplify.

Andrew Mayne: Wait. Wait. Could you say that again? Say that name again?

Nick Turley: It was gonna be "Chat with GPT-3.5."

Andrew Mayne: Chat.

Nick Turley: Which rolls off the tongue even more nicely.

Andrew Mayne: And you said that was a late-night decision, meaning like weeks before you finally decided what to call it.

Nick Turley: Right. Right. No. Weeks before we hadn't started on the project yet, I think.

Andrew Mayne: Oh, goodness.

Nick Turley: But, you know, I think we realized that that would be hard to pronounce and came up with a great name instead.

Andrew Mayne: So that was the night before?

Nick Turley: Roughly. Yeah. Might have been the day before. Was all kind of a blur at that point.

Andrew Mayne: I would imagine a lot of that was a blur. And I remember here, I remember being in a meeting where we talked about the low-key research preview, which, like, really was. Like, we really thought like, oh, this is because it was the 3.5. 3.5 was a model that had been out for months. And from a capabilities point of view, when you just look at the evals, you're like, yeah, it's the same thing, but we just put the interface in here and made it so you didn't have to prompt as much. And then ChatGPT comes out. And when was the first sign that this thing was blowing up?

Nick Turley: I mean, I'm curious for everyone has their slightly own recollection of that era because it was a very confusing time. But for me, day one was sort of, you know, "Is the dashboard broken?" Classic like, "The logging can't be right." Day two was like, "Oh, weird. I guess Japanese Reddit users discovered this thing. Maybe it's like a local phenomenon." Day three was like, "Okay, it's going viral, but it's definitely gonna die off." And then by day four, you're like, "Okay, it's gonna change the world."

Andrew Mayne: Mark, did you have any expectation about that?

Mark Chen: No. Honestly, I mean, we've had so many launches, so many previews over time, and, yeah, this one really was something else. Right? The takeoff ramp was huge, and, yeah, my parents just stopped asking me to go work for Google.

Andrew Mayne: Wait. So, wait. Wait. Wait a second. Up until ChatGPT, your parents were asking, like, what you're doing here?

Mark Chen: Yeah. No. I mean, yeah, they just never heard of OpenAI. Right. I think for many years thought AGI was this pie-in-the-sky thing, and I wasn't having a serious job. So it was a real revelation for them.

Andrew Mayne: Yeah. What was your job title at the time?

Mark Chen: I think just member of technical staff.

Andrew Mayne: Member of technical staff. Yeah. And then that blew up, and now you're head of research?

Mark Chen: I guess so. Yeah.

Andrew Mayne: So, alright.

Mark Chen: Yeah. Actually, on the GPT name, I think even half of research doesn't know what those three letters stand for. It's kind of funny. You know, like, half of them think it's generative pre-trained. Half of them think it's generative pre-trained transformer.

Andrew Mayne: And what is it?

Mark Chen: It's the latter.

Andrew Mayne: Okay. Alright. Yeah. Those people, they don't know the name of it. Yeah. It's weird how just a silly name like that all of a sudden becomes a thing. But you see that with like, you know, Google, Yahoo, Kleenex, things like that. Xerox. And sometimes they were, some of those were names by intention, and this was really just a silly sort of name. For me, the moment that I felt like after watching the launch, watching it accelerate, I knew what was gonna happen, and then what it did was when it was on South Park. And remember that when South Park made fun of the name and—

Nick Turley: That was the first time I'd watched South Park in, oh, let's just say a while. And that episode, I still think it's magic. Yeah. It was obviously profound to watch and see, you know, something you helped make show up in pop culture. But there's the punch line in the end where it's like, "Oh, this was co-written by ChatGPT."

Andrew Mayne: I think they took that off, though. I think in—

Mark Chen: When did?

Andrew Mayne: I think in later episodes, because it used to say, I think, by, like—

Nick Turley: Oh, man.

Andrew Mayne: Trey Parker and, like, and then, no, it was. And then I think later, I think they may have pulled that off at some point. I don't remember.

Nick Turley: Oh, I strongly feel that you shouldn't have to give credit to it. It's, yeah. It's always necessary that you're using the—

Andrew Mayne: If I had to give credit to ChatGPT for every aspect of my life, well, might as well just say "ChatGPT, maybe with Andrew." True.

Mark Chen: Did you use it for prep for your interviews?

Andrew Mayne: You know, one of my co-producers, Justin, probably uses it. I haven't asked him yet because I'd like to think that he's handcrafting every single question that we're thinking about here, but I am sure. You say it was a bit of a blur. I'll tell you, like, standout moment for me at the launch of ChatGPT was, I don't know if you remember this, but the Christmas party. And we'd had several weeks of ChatGPT out there, and Sam Altman went up and said, "Hey, it's been exciting to watch this, but the Internet being the Internet," and I think we all felt this way, "it's gonna die down." Spoiler alert: It did not die down, and it just kept accelerating. What were the things you had to do internally to sort of keep this thing up and running as more people wanted to use it?

Scaling Challenges and the "Fail Whale"

Nick Turley: We had quite a few constraints. And for those of you who remember, you know, I think you guys remember ChatGPT was down all the time in the beginning, and that was, yeah. We'd said, "Hey, this is a research preview. No guarantees. Maybe it goes down," but the minute you had people loving and using this thing, that didn't feel super good, so people were certainly working around the clock to keep the site up. I remember, you know, we obviously ran out of GPUs. We ran out of database connections. We had, you know, we're getting rate-limited in some of our providers. Nothing was really set up to run a product. So in the beginning, we built this thing, we called it the fail whale, and it would just tell you kind of nicely that the thing was down and made a little poem, I think it was generated by GPT-3, about being down and and sort of tongue in cheek. And that got us through the winter break because we did want people to have some sort of a holiday. And then when we came back, we were like, "Okay, this is clearly not viable. You can't just go down all the time." And eventually, we got to something we could serve everyone.

Mark Chen: Yeah. And I think, you know, the demand really speaks to the generality of ChatGPT. Right? We had this thesis that ChatGPT embodied what we wanted in AGI just because it was so general. And I think, you know, you're seeing that demand ramp just because people are realizing, you know, any use case that I want to to give or to throw to the model, it can handle.

Andrew Mayne: We were kind of known as the company working on AGI. And I think prior to ChatGPT, the API was certainly the first time we had a public offering where people could go use it and do it, but then it was more for developers and stuff. And I think that as long as people were sort of thinking AGI, that seemed to be the point at which people thought these models would be useful. But we saw GPT-3, we saw that that was useful, and then we saw that we could do other things that were useful. Was everybody at OpenAI on board with ChatGPT being useful or being ready to launch?

Mark Chen: Yeah. I don't think so. You know? Even the night before, I mean, there's this very famous story at OpenAI of, you know, Ilya taking 10 cracks at the model, you know, 10 tough questions. And my recollection is maybe only on five of them, he got answers that he thought were acceptable. And so there's a real decision the night before: Do we actually launch this thing? Is the world actually gonna respond to this? And I think it just speaks to when you build these models in-house, you so rapidly adapt to the capabilities.

Nick Turley: Mhmm.

Mark Chen: And it's hard for you to kinda put yourself in the shoes of someone who hasn't kind of been in this model training loop and see that that there is real magic there.

Nick Turley: Yeah. Yeah. Yeah. I think to build on the, like, the controversy internally about, you know, "Is this thing good enough to launch?" I think, is humbling, right, because it's just a reminder of how wrong we all are when it comes to AI. It's why, you know, frequent contact with reality is so important.

Andrew Mayne: Could you elaborate more on that contact with reality? What does that mean?

Mark Chen: Yeah. I mean, when you think about iterative deployment, one way I like to frame it is, you know, there's no point everyone agrees where it's suddenly useful. Useful. Right? And I think usefulness is this big spectrum. And so, you know, there's not one capability level or one bar that you meet, and suddenly, you know, the model is useful for everyone.

Andrew Mayne: Were there any hard decisions about what to include or what to focus on?

Nick Turley: We were very, very principled on ChatGPT to not balloon the scope. We were adamant to get feedback and data as quickly as we could.

Andrew Mayne: I'm always in Slack telling you things about the way that didn't make, "Add this."

Nick Turley: I remember actually there was a lot of controversy about the UI side. For example, we didn't launch with history, even though we thought people would probably want that, and guess what? That was the first request. I also think there's always the question, "Can we train an even better model with two weeks more time?" I'm glad we didn't because we, I think, got a ton of feedback as we did. So, yeah, there was a ton of the scope discussions, the holidays were coming up, so I think we had this natural forcing function for getting something out.

Andrew Mayne: Yeah, there was this habit of things that if it's gonna come after a certain point in November, it's not gonna come out until February. There's a sort of window where things would fall on either side.

Nick Turley: Well, would be the classic method in a big tech company. I think we're definitely a bit more flexible on the ownership.

Andrew Mayne: I felt like one of the big impacts was once people are out using it, it felt like the rate of these things improving was tremendous. I don't know if that was something that we really had in the calculus. We could certainly think about training on larger, cite more data, scaling compute, but then the idea of actually having the signal you would get from that many people using it.

Mark Chen: Yeah. I think over time, feedback really has become an integral part of how we build the product. And it's also become an integral part of safety. And so you always feel the time cost of losing out on feedback. You know, you can deliberate in a vacuum. Right? Are they gonna respond to this better? Are they gonna respond to that better? But it's just not a substitute for just bringing it out there. Right? I think our philosophy is let the models have contact with the world. And if you need to revert something, that's fine. But I think there's really no substitute for this fast feedback, and it's become one of the big levers for how we improve model performance too.

Nick Turley: It's sort of funny. Like, I feel like we started with shipping these models in a way that is more similar to hardware where you make, like, one launch. Very rarely, and it has to be right, and, you know, you're not gonna update the thing, and then you're gonna work on the next big project. It's capital-intensive, and the timelines are long. And over time, and I think ChatGPT was kind of the beginning, it's looked more like software to me, where you make these frequent updates. Mhmm. You have kind of a constant pace the world can adopt. Something doesn't work, you pull it back, and you sort of lower the stakes in doing that, and you increase the empiricism. And of course, just operationally too, you can innovate faster in a way that is more and more in touch with what users want.

The Sycophancy Problem and Model Bias

Andrew Mayne: Yeah. One of the examples we had of that was the model becoming too obsequious or sycophantic. Could you explain what happened there? That was where people all of a sudden say, "Hey. It's telling me I've got a 190 IQ. I'm the most handsome person in the world," which I had no problem with personally. But other people did. And what was going on there?

Mark Chen: Yeah. So I think one important thing is we rely on user feedback to move the models. Right? And it's this very complicated mix of reward models, which we use in a procedure we call RLHF, using human feedback to use RL to improve the models.

Andrew Mayne: Could you give me just a brief example of what that would mean?

Mark Chen: Yeah. Yeah. So I think one way to think about it is when a user enjoys a conversation, you know, they provide some positive signal.

Andrew Mayne: Thumbs up.

Mark Chen: Yeah. A thumbs up, for instance. And we train the model to prefer to respond in a way that would elicit more thumbs up. Right? And this may be obvious in retrospect, but stuff like that, if balanced incorrectly, can lead to the model being more sycophantic. Right? You can imagine users might want that kind of that feeling of, you know, a model saying good things about them, but I don't think it's a very good long-term outcome. And, actually, when we look at kind of our response to and the rollout that resulted there, I think there were a lot of good points about it. You know, this was something that was flagged just by a small fraction of our power users. It wasn't, you know, something that a lot of people who generally use the models noticed. And I think we really picked that out fairly early. We responded to it, I think, with the appropriate level of gravity. Mhmm. And, yeah, I think it it just shows that, you know, we really do take these issues quite seriously, and we wanna intercept them very early.

Andrew Mayne: Yeah. It felt like there was maybe 48 hours since the model came out, and then Joanne Zhang had a response explaining exactly what happened. And I think that that's the, that's the hard part. How do you navigate that? Because the problem with social media is you're basically monetized by engagement time. You wanna keep people on there longer so you can show them more ads. And certainly, the more people use ChatGPT, obviously there's a cost to open an ad. The idea is maybe use it once and stay around forever, but that's not practical. How do you weigh that? The idea of making people happy with what they're getting versus making the model be broadly more useful than just pleasing?

Nick Turley: I feel very lucky in this regard because we have a product that's very utilitarian. People use it to either achieve things that they do know how to do but don't feel like doing faster or with less effort, or they're they're using it to do things that they couldn't do at all. You know? First example is maybe, you know, writing an email that you've been dreading. Second example might be, you know, running a data analysis that you didn't actually know how to do in Excel. You know? True story. So, so, you know, those are very utilitarian things. Fundamentally, as you improve, you actually spend less time on the product. Right? Because, you know, ideally, it takes less turns back and forth, or maybe you actually delegate to the EIs, so you're not in the product at all. So for us, you know, time spent, it's very much not the, not the thing we optimize for. You know, we do care about your long-term retention because we do think that's a sign of value. If you're coming back three months later, that clearly means we did something right. But what that means is, you know, I always say, "Show me the incentive, and I'll share the outcome." We have, I think, the right fundamental incentives to build something great. That doesn't mean we'll always get it right. The sycophancy events were really, really important and good learning for us, and I'm proud of how we acted on it. But fundamentally, I think we have the right setup to build something awesome.

Andrew Mayne: So that brings up the challenge. I wonder how you navigate that is that one of the things early on when ChatGPT came out, was the allegation that it's woke, it's woke, and people are trying to promote some sort of agenda from it. My argument always been, you train a model on corporate speak, average news, and a lot of academia. That's gonna kind of follow into that. And I remember Elon Musk was very critical about it. And then when he trained the first version of Grok, it did the same thing. And then he's like, "Oh, yeah. When you trained it on this sort of thing, it did that." And internally at OpenAI, there were discussions about how do we make the model not try to push you, not try to steer you. Could you go a little bit how you try to make that work?

Mark Chen: Yeah. So I think at its core, it's a measurement problem. Problem. Right? And I think it's actually bad to downplay these kind of concerns because they are very important things. Right? And we need to make sure that the model, the default behavior that you get is something that's centered that, you know, doesn't reflect bias on the political spectrum or in in many other, you know, axis of bias. And at the same time, you know, you do want to allow the user the capability to, you know, if you wanted to talk to a reflection of something with more conservative values to be able to steer that a little bit. Right? Mhmm. Or liberal values. Right? And and so I think the thing is you wanna make sure that defaults are meaningful and they're centered, and that's a measurement problem. Mhmm. And you also want to give ability some flexibility, right, within bounds to steer the model to be a persona that you wanted to talk to.

Nick Turley: I think that's right. I think, you know, in addition to neutral defaults, abilities to bring your own values to some extent, I think, you know, being transparent about the whole thing is, I think, really, really important. I'm not a fan of of, you know, secret system messages that, you know, try to, like, you know, hack the model into saying or not saying something. What we've tried to do is publish our specs. So you can go look at, you know, if you're getting certain model behavior, is that a bug? You know, is it a violation of our own stated spec, or is it actually in the spec, in which case you know who to criticize and who to who to yell at? Or is it just under-specified in the spec, in which case that allows us to improve it and add more specificity into that document? So by sort of publishing the rules of the AI that it's supposed to be following, I think that's an important step to have more people contribute to the conversation than just the people inside of OpenAI.

Andrew Mayne: So we're talking about the system prompt, the part of the instruction that the model gets before the user puts the input.

Mark Chen: Well, I think it's beyond that.

Nick Turley: System Prompt is one way to steer the model, but it, you know, it goes much deeper into that. Right? You know? Yeah.

Mark Chen: We have a very large document that outlines across a bunch of different behavior categories how we expect the model to behave. And just to give you an example here. Right? You can imagine if there's someone who comes in with just, like, an incorrect belief, just a factually incorrect—

Andrew Mayne: Mhmm.

Mark Chen: Kind of a point of view. How should the model interact with that user? Right? And should it reject that point of view outright, or should it collaborate with the user on kind of figuring out what's what's true together? And, you know, we take that latter point of view, and I think there are a lot of very subtle decisions like this, which we put a lot of time in.

Andrew Mayne: Yeah. That that's a hard one because I think some things you can test for and you can try to figure out in advance, but when you're trying to figure out how an entire culture is gonna adopt something that's challenging, like if I was somebody who's convinced that the world was flat, you know, like how much should the model push back against me? And some people are like, "Oh, it should push that back all the way," but it's okay. What if you're one religion or not another?

Nick Turley: Yeah. Turns out rational people and well many people can disagree on how the model should behave in these instances. And you're not always gonna get it right, but you can be transparent about what approach we took. You can allow users to customize it, and I think this is our approach. I'm sure there's ways we can improve on it, but I think transparent and the open about how we're trying to tackle it, we can we can get feedback.

Human-AI Relationships and ImageGen's Breakthrough

Andrew Mayne: How are you thinking about as people start to use these models more and more, regardless of whether or not that's some dial you're trying to turn, it's just the more useful it becomes, the more people want to use it. There was a time when nobody wanted a cell phone, and now we can't get away from them. And how are you thinking about relationships people are forming with their systems?

Nick Turley: Obviously, I mentioned this earlier. This is a technology you have to study. It's not designed in a static way to do X, Y, Z. It's highly empirical. So, you know, as people adopt in the way that they use the product, it's something that we we need to go understand and act on as well. I've been observing this trend with interest where I think, increasing number of people, especially Gen Z and, you know, younger populations are coming to ChatGPT as a thought partner, and I think in many cases, that's really helpful and beneficial because you've got someone to brainstorm on a relationship question, you've got someone to brainstorm on a professional question, or something else. But in some cases, it can be harmful as well, and I think detecting those scenarios and first and foremost, having the right model behavior is very, very important to us. Actively monitoring it, and in some ways, it's one of those problems we're gonna have to grapple with because with any technology that becomes ubiquitous, it's gonna be dual use. People are gonna use it for all this awesome stuff, and people are gonna use it in ways that we wish they didn't. And we have some responsibility to make sure that we handle that with the appropriate gravity.

Andrew Mayne: I find myself having longer conversations with it. I like the memory function. I like the fact you can turn it off if you don't want. And I think about, like, you know, what's this gonna be two years from now or three years from now when it has a much longer memory, much more context with this? I like the idea to have these sort of like, you know, Memento anonymous modes too, where it's not gonna store this. But I kind of wonder how much you've been thinking about two years, three years down the road. What what's that going to be like when ChatGPT knows way more about you?

Mark Chen: Yeah. I mean, I think memory is just such a powerful feature. In fact, it's one of the most requested features when we talk to people externally. It's like, "This is the thing I really wanna pay pay more for." And I think, you know, you liken it to if you've ever kind of had a personal assistant, you know, you—

Andrew Mayne: No. I'm not.

Mark Chen: Well, you you do need to—

Nick Turley: Build up contacts more.

Andrew Mayne: Mean, I'm sorry, guys. I'm sorry, guys.

Mark Chen: But, you know, it's, yeah. It's just like it's kind of in any kind of relationship that you have with a person. Right? You you build up contacts with them over time. Mhmm. And I think just the more they know about you, right, the richer the relationship, the more, you know, it can also help you. Right? You can work together to collaborate on tasks together.

Andrew Mayne: I do become self-conscious of the fact that it knows everything about me when I'm grumpy, and I've I've I've argued with it recently, by the way.

Nick Turley: That's good. Yeah. You should be able to argue with it. You understand a lot about yourself and having a thing to argue with, and I think you spare others of that experience, which which can also be beneficial.

Mark Chen: Don't argue on math and science. You're not gonna win those.

Nick Turley: Yeah. No. I think increasingly, very unlikely. Yeah. Yeah. I think memory's cool. And to Mark's point, it's been part of our vision for for a long time because we said we were gonna build a super assistant before we really knew what that meant. ChatGPT was sort of the early demonstration to that idea. But if you kind of think about real-world intelligences, even they are not particularly useful on their first day, and I think being able to solve that problem or begin to solve that problem has been profound. To your earlier question, though, it really does feel like if you fast forward a year or two, ChatGPT or things like it are gonna be your most valuable account by far. It's gonna know so much about you, and that's why I think giving people ways to talk about this thing in private is very important. We make this temp chat thing very, it's literally on the home screen because we think it's increasingly important to talk about stuff sort of off the record too. It's an interesting question, and I think privacy and AI is gonna be an interesting one for the next coming I—

Andrew Mayne: I wanna switch gears, talk about another release, which again, kinda caught people by surprise and blew up, was ImageGen. And I was here for DALL-E, DALL-E 2, and then DALL-E 3 came out. I thought DALL-E 3, I thought, a very capable model, but it seemed like it preferred a certain kind of image, and a lot of the utility and the capabilities for variable binding was sort of kind of hidden away. And then ImageGen was kind of just this breakthrough moment that it caught me off guard. How did you guys feel about the launch of that?

Mark Chen: Yeah. Honestly, it caught me off guard too. And this is really props to the research team. You know? Gabe, in particular, did a ton of work here. Kenji, many others on it.

Nick Turley: It's amazing.

Mark Chen: Did phenomenal work. And I think it really spoke to this thesis that when you get a model just good enough that in one shot, it can generate an image that fits your prompt, that's gonna create immense value. And I think we never quite had that before, right, that you just get the perfect generation oftentimes on the first first try. And I think that's something very powerful. You know? Like, people don't wanna pick the best out of a grid. I think, yeah, you just got very good prompt following and, you know, this great style transfer too. Right? Yeah. This ability to kind of put images as context for the models to modify and to change and the fidelity that you could do that with, I think that was really powerful for people.

Nick Turley: I think I think this ImageGen experience, it was just kind of another mini ChatGPT moment—

Andrew Mayne: Mhmm.

Nick Turley: All over again, where, you know, you have kind of this, you've been staring at this for a while, you're like, "Yeah, it's gonna be cool. I think people are gonna like it," but you kinda, you know, you're launching like 20 different things, and then suddenly, the world is going crazy in a way that you you kinda only find out by shipping. Like, I remember distinctly, you know, we had, 5% of the Indian Internet population tried ImageGen over the weekend. And I was like, "Wow, we're reaching new types of users who we wouldn't even have thought, you know, who might not have thought of using ChatGPT. That's really cool." And to Mark's point, I think a lot of this is because there's this discontinuity where something suddenly works so well and truly the way you expected, where I think it blows people's minds. I think we're gonna have those moments in other modalities too. I think voice, you know, it it it hasn't quite passed the Turing test yet, but I think the minute it does, people are gonna, I think, find that immensely powerful and valuable, you know, video is gonna have its own moment where it starts meeting the expectations that users have. So I'm really excited about the future because I think there's so many of these magical moments coming that are really gonna transform people people's lives. And and also, you change sort of ChatGPT's relevance for people because, you know, there's, I've always felt like there's text people and there's image people, and some of them are a little bit different, and now they're all using the product and discovering the value across the board.

Andrew Mayne: The moment when it launched, I think it kind of illustrated the problem that had been with image models before. And when DALL-E came out, was super exciting because you're like, "I'm doing pictures of space monkeys and all these sorts of things." The moment you try to do a really complex image, and that's the phrase I brought up before, which is variable binding, you start to see these things drop off. And that was when I realized, "Oh, there's gonna be a challenge for other image systems that don't have kind of the scale and the compute of, like, a GPT-4 under the hood." And now, was it just was it basically that, like taking, like, a GPT-4 scale model and say, "Now you do images," that made the breakthrough?

Mark Chen: Well, I think there are a lot of different parts of research that made made this such a big success. Right? I think with a complicated multi-step pipeline, it's never just one thing. Right? It's, like, very good post-training. It's very good training. And I think it's just all of that coming together. Right? Variable binding definitely was one thing that we paid a lot of attention to. I think one thing about the ImageGen launch is is a launch that was very deep. I think people, you know, they started by working on, you know, creating anime versions of themselves. Mhmm. But you realize when you play with it more, you know, the infographics, they work. Oh, yeah. Like, you actually create charts. Can comic book panels.

Nick Turley: Yeah. You can mock up what your home would look like with exactly different furniture in it. Different furniture. Exactly. We've heard all these things from from users that are, like, completely surprising about the way that—

Andrew Mayne: You see the— We did the the podcast setup by literally taking some photos of chairs in the and just putting in there and saying, "Create a better setup." And it was cool. Amazing. So we've seen kind of a lot of the, you know, there was a lot of the anime-style images, which kinda like for some re it was just sort of the weird thing where it was just just better than what we'd seen before. And I don't think anybody is ready to be really surprised by an image model in that way. I think, obviously, internally and externally, what were some of the things that surprised you or some of the new things you saw people doing?

Mark Chen: Yeah. I'll be, I'll tell you a quick story there too. Because, you know, up until the day of launch, we're trying to figure out what's the right use case to showcase, you know, like, and I think I'm so glad we ended up on kind of anime styling. It's just everyone looks good as an animated—

Andrew Mayne: Character, so—

Nick Turley: That's true. I mean, it's funny. With the original ChatGPT, I thought it would be a strictly utilitarian product, and then I was surprised that people use it for fun. In this case, it was sort of the opposite, where I was like, "Okay, this is gonna be really cool for memes. People are gonna like have fun with this thing." But then I was, like, really surprised by all the, like, genuinely useful ways of using ImageGen, whether or not it's, you know, planning your home project, as I mentioned earlier, you know, of of you're doing construction, you wanna see what things would look like if, you know, you had this remodel or this furniture or whatever, to you're working on a slide deck for this important presentation, and you just wanna have really useful, consistent illustrations that are on topic and get it. So, so I really have been been kind of personally surprised by the utility in this case because I knew it would be fun. That was not a question.

Mark Chen: Yeah. I think I used it to generate a tier list of AI companies, and then put it opening at the top.

Andrew Mayne: You win, model. Good post-training. Yeah. It just happened. Who knew? What has been the thinking in it's changed, because I remember originally with DALL-E, the idea of like, "Okay, we have to be a lot of very controlled about what it can do, what it can't do." Originally, I remember when we first launched, you couldn't do people, which was not a very useful model. And then finally was trying to roll back. How much of that was cultural shift? How much of that was the technological ability to control for things? And how much of that was just saying we've gotta push the norms?

Nick Turley: I would say it was both cultural shift and improvement in our ability to control things. The cultural shift, you know, I'm not gonna deny it. I think when I joined OpenAI, there was a lot of conservatism around, you know, what capabilities we should give to users, maybe for good reason. The technology is really new. A lot of us were new to working on it, and, you know, if you're gonna have a bias, you know, biasing towards safety and being careful, it's not a bad, you know, DNA to have. But I think over time we learned that there's so many positive use cases that you effectively prevent when you make arbitrary restrictions in the model.

Andrew Mayne: What about faces? Why not? Why can't I make any face I want?

Nick Turley: So this is a good example of a capability that's got pros and cons, and you can air on one side or the other. You know, when we first shipped image uploads into ChatGPT, we had some debates, you know, about what what capabilities do you allow versus where are you conservative? And I think one debate that we had is, like, do we upload allow the upload of images with faces? Or rather, when you upload an image that contains a face, do you you know, should we just, like, gray out the face? Because you avoid so many problems. Right? Yeah. You can make inferences about people based on on their face. You could say mean things to people based on their face. And and, you know, you would just take a giant shortcut on all the gnarly issues if you didn't allow that, but I've always felt we need to air on the side of freedom, and we need to do the hard work. And I think in this case, you know, there's so many valid ways. You know, if I want feedback on makeup or on my haircut or anything like that, I wanna be able to talk to ChatGPT about it. That is our valuable and benign use cases. And I would prefer to allow and then study, you know, where does that fall short? Where is that harmful? And then iterate from there versus taking a default stance on disallowed. And I think that's one of those ways in which our stance and posture has changed a bit over time in terms of where we set, you know, where we start.

Andrew Mayne: Yeah. We're very good, I think, imagining worst-case scenarios. "What if I use these faces to evaluate hires for a company or whatever?" But also it's like, "Hey, is this eczema?" You know, there's a lot of utility there.

Nick Turley: And honestly, I think there are certain demands of AI safety where worst-case scenario thinking is very appropriate. Mhmm. So I think that is an important way of thinking about risk when it comes to certain forms of risks that are existential or even just very, very bad. You know? We have the preparedness framework, which helps us reason through some of those things. You know? Can the AI let you make it a bioweapon? It's good to think about the worst case there. It could be really, really bad. So you kind of have to have that way of thinking in the company, and you have to have certain topics where you think about safety in that way, but you can't let that kind of thinking spill over onto other domains of safety where the stakes are lower because you end up, I think, making very, very conservative decisions that block out many valuable use cases. So I think being sort of principled about different types of safety on different time horizons and with different levels of stakes is very important for us.

Andrew Mayne: I think I want a blunt mode sometimes and just like right—

Nick Turley: Where it actually roasts you?

Andrew Mayne: Well, I mean, like, yeah, because I'll ask the model, because with the voice in, speech out model, be like, "Do I sound tired?" And it's like, "Well, you know, I don't really wanna, you know," and I'll be like, "Yeah, you know, just trying to get it to be honest."

Nick Turley: You know, I think I think there's many cultures that would prefer a blunter ChatGPT, so very much on the radar.

Mark Chen: Yeah. Just to piggyback off Nick's answer, I think it's the iterative deployment that gives us the confidence, right, to push towards user freedom. And, you know, we've had many cycles of this. We know what users can and can't do. And that gives us the confidence to launch with the restrictions that we do.

The Evolution of Code and AI in the Workplace

Andrew Mayne: One of the other capabilities, one of the other generative capabilities that's been very interesting has been code. And I remember early on GPT-3, we saw that all of a sudden it could spit out entire React components, and we saw that, "Oh, wow, there's some utility there." And then we went. We actually trained a model more specifically on code. And that led to we had Codex. And we had CodeInterpreter. Now Codex is somehow back. And a new form, same name, but the capabilities keep increasing. And we've seen code work its way first into Versus Code via Copilot, and then Cursor, and then Windsurf, which I use all the time now. What, how much pressure has there been in the code space? Because I'd say that if we ask people who made the top code model, we might get different answers.

Mark Chen: Yeah. And I think it reflects that when people talk about coding, they're talking about a lot of different things. Right? I think there's coding in a specific paradigm. Like, if you pull up an ID and you wanna kinda get a completion on a on a function, that's very different from, you know, agentic style coding where, you know, you ask, you know, "I want I want this PR." And, you know, and I think we've done a lot of focus.

Andrew Mayne: Could you, could you unpack a little bit what you mean by agentic coding?

Mark Chen: Yeah. Yeah. So I think when you you can draw a distinction between more kind of real-time response models. You can think of ChatGPT to first order as you ask a prompt, and then you get a response fairly fairly quickly. And a more agentic style model where you give it a fairly complicated task. You let it work in the background, and after some amount of time, it comes back to you with what it thinks is something close to the best answer. Mhmm. Right? And I think we see increasingly that the future will look like more of a async kind of a, you know, where you're asking it very difficult, hard things. And you're letting the model think and reason and come back to you with really the best version of, like, what it can come back with. And we see the evolution of code in that way too. I think, eventually, we do see a world where you'll kind of give a very high-level description of what you want—

Andrew Mayne: Mhmm.

Mark Chen: And the model will take time, and it'll come back to you. And so I think our our first launch Codex really reflects that kind of paradigm where we are giving it PRs units of fairly heavy work. That encapsulate, you know, a new feature or, you know, a big bug fix, and we want the model to spend a lot of time thinking about how to accomplish this thing rather than kinda give you a fast response. Mhmm.

Nick Turley: And to get to your question, you know, there's there's there's coding is such a giant space. There's so many different angles at it. It's kinda like talking about knowledge work or something incredibly broad, which is why I don't think there's one winner, and I think there's one best thing. I think there's so many options, and I think developers are the lucky ones because they have so many choices right now, and I think that's fundamentally exciting for us too. But to Mark's point, I think this agentic paradigm has been particularly exciting for us. One framing I often use when thinking about product here is I I wanna build products that have the properties such that, you know, if the model gets 2x better, product gets 2x more useful. You know, I think, yeah, ChatGPT has been a wonderful thing because for a long time, I think that was true, but I think as we look at, you know, smarter and smarter models, I think there's some limit to people's desire to talk to, like, a PhD student versus, you know, they might value other attributes of the model, like its personality and what it can actually do in the real world. But experiences like Codex, I think they create the right body such that we can drop in smarter and smarter models, and it's gonna be quite transformative because you get the interaction paradigm right where people can specify this task, give the model time, and then get a result back. So I'm really excited where it's gonna go. It's an early research preview, but just like with ChatGPT, we felt like it would be beneficial to get feedback as early as possible, and excited where we're gonna take it.

Andrew Mayne: I was using Sonnet a lot, which I love. I think Sonnet for coding is fantastic, but with o4-mini-medium setting in Windsurf, I found it was great. I found that once I started using that, I was really happy because, one, the speed, everything else like that. And I think that and I think there are very good reasons why people like other models, and I don't want to get into comparison. But I found that for me, the kinds tasks I was using, this was the first time. I was very happy you guys put that out there because—

Mark Chen: Absolutely. Yeah. And, you know, we feel like there's still a lot of low-hanging fruit in code. It is a big focus for us, and I think we'll find in the near future, you'll find many more good options for the right code model tailored for your use case.

Andrew Mayne: Yeah. I find often if I just need a quick answer to, like, how to write something in Dart, does it get a 4.1 and say, what, yeah. Something bigger. I think that's gonna be the harder part is because, yeah, these evals are some way saturated, but also everybody has their own criteria that we look at. And that's going to be kind of a question to sort of see how are we going to adapt to all that.

Mark Chen: Right. Yeah. I mean, specifically in code. Right? I think there's more beyond, did it get you the right answer? With code, you know, people care about the style of the code. They care about, you know, how verbose it was in the comments. It cares about, you know, how much proactive work did the model do for you, right, on other functions. And so I think, you know, there's a lot to get right, and users often have very different preferences here.

Nick Turley: It's funny. I used to I used to, you know, people used to ask me, "Hey. Well, what domains are gonna, like, you know, be transformed by, you know, the fastest?" And I used to say, you know, it's code because, like, similar to math and other things, it's very, very verifiable and decimal, and I think those are the domains that are particularly great to do RL on, and, you know, you're therefore gonna see all this this awesome, you know, agentic stuff just suddenly work. I still think that's true, but the thing that surprised me about code is that, you know, there is still so much of an element of taste in terms of what makes good code. And there's, you know, there's a reason that, you know, people train to be a professional software engineer. It's not because their IQ gets better because but rather because they learn, you know, how how to build software inside an organization. What does it mean to write good tests? What does it mean to write good documentation? How do you respond when someone disagrees with your code? Those are all actual elements of being a real software engineer that we're gonna have to teach these models to do. So I expect progress to be fast, and I still think code has a ton of nice properties that make it very ripe for the agentic products, but I do think it's very interesting to the degree that the element of taste and style and real-world software engineering matters.

Andrew Mayne: It's interesting, too, because with ChatGPT and the other models, you're kind of dealing with having to bridge the divide between consumer and pro. I open up ChatGPT, and I tell my friends, like, "Oh, yeah, because I'll plug it into whatever code model I'm working because I can actually connect it to there." And I think about, "Well, that's a very different use case a lot of other people." Although I've shown people how to go in and use an IDE and actually have it just write documents for you and create folders and stuff, which people don't realize, "Yeah, you could do that. You could have ChatGPT actually control it and do that," which is cool. But then you think about like, "Okay, we've got a tab now for images. There's the Codex tab. So if I want to connect to GitHub and have it work through there, and there's a Sora into there." So it's kind of interesting to see how all of these things are coalescing into there. How do you differentiate between a consumer feature, a professional feature, and maybe an enterprise feature?

Nick Turley: Look, we build very general-purpose technology, and it's going to be used by a whole range of folks. And unlike many companies which have this kind of founding user type, and then they use technology to solve that user's problems, we do start off the net with the technology, observe who finds value in it, and then iterate for them. Now with Codex, our goal was very much to build for for professional software engineers, knowing though that there's sort of a splash zone where I think a lot of other people will find value in it, and we'll try to make it accessible for those people as well. There are a lot of opportunities to target non-engineers. I'm personally really motivated to create a world where, you know, or help help build a world where anyone can make software. Codex is not that product, but you could imagine those products existing over over over time. But, you know, as a general principle, it's really hard to predict exactly who the target user is until we made some of these general-purpose technologies available because it gets back to the empiricism I was talking about. We just never exactly know where the value's gonna lie.

Mark Chen: Yeah. And I think even to dig deeper into that, assuming, like, you know, you could have a person who's mostly using ChatGPT for coding, right, but 5% of the time, you know, they might just wanna talk to the model or, like, 5% of the time, they just want a really cool image. Right? And so I think, you know, there are certainly archetypes of people who who use the the models, but in practice, we see that people want this exposure to different capabilities. Yeah.

Andrew Mayne: With Codex and watching the launch of that, it kind of struck me there are some tools you see that there's a lot of excitement about because there's a lot of internal demand for that. How much are you using it internally? Are tools like that? More and more. Okay.

Nick Turley: I've been really excited to see internal adoption. It's everything from, you know, exactly what you'd expect, you know, people using Codex to offload their tests to, you know, we have an analyst workflow that will look at, you know, logging errors and automatically flag them and Slack people about it. So there's all these these ways that we're or or I've actually heard heard some people are using it as a to-do where, like, future tasks they're they're hoping to do, they're starting to fire off Codex tasks. So this is the perfect type of thing that I think you can you can talk with internally. And and, you know, I'm very excited about, you know, the leverage that engineers are gonna get out of a tool like this. I think it's gonna allow us to move faster with the people we have and make each engineer that we hire 10 times more productive. So in some ways, internal usage is a very good predictor of where we wanna take this.

Mark Chen: Yeah. I mean, we don't wanna ship something to other people that we don't find value in ourselves. And I think, you know, leading up to the launch—

Andrew Mayne: Laundry buddy.

Mark Chen: Laundry. Laundry—

Nick Turley: Buddy is an essential partner.

Andrew Mayne: Okay. Sorry. Sorry.

Mark Chen: I mean, yeah, we I mean, we had some power users, though, that, you know, hundreds of PRs a day that they were generating personally. Right? So I think, you know, there are people internally finding a lot of utility from what we're building.

Nick Turley: Also, if if you think about internal adoption, it's also a good reality check because, you know, people are busy. Adopting new tools takes some activation energy. So actually, the thing you find when you try to dive through things internally is is some of the reality component of how long it takes people to actually adjust to a new workflow, and it's it's been it's been humbling to to to watch. Right?

Mark Chen: Mhmm.

Nick Turley: So, so I think you learn both about the technology, but you also learn about some of the adoption patterns when you're trying to get a bunch of busy people to change the way they write code.

Skills for the Future and the Power of Iterative Deployment

Andrew Mayne: As you build these tools, internally people have to learn how to use them and are having to adapt. And there's a lot of question now about what kind of skills do people need in the future. What kind of skills do you for on your teams?

Nick Turley: I've thought about this a lot. Hiring is hard, especially if you want to have a small team that is very, very good and humble and able to move fast, etcetera. Mhmm. And I think curiosity has been the number one thing that I've looked for, and it's actually my advice to students when they ask me, "What do I do in this world where everything's changing?" Because, I mean, for us, there's so much that we don't know. There's a certain amount of humility you have to have about building on this technology because you don't know what's valuable. You don't know what's risky until you really study and go deep and and try to understand. And when it comes to working with AI, which we obviously do a lot, not just in code, but in kind of every facet of of our work. It's asking the right questions that is the bottleneck, not necessarily getting the answer. So I really fundamentally believe that we need to hire people who are deeply curious about the world and what we do. I care a little bit less about their experience in AI. Mark presumably feels a bit different about that one, but for the product side, it's been curiosity that I've I've found the most the best predictor of success.

Mark Chen: No. I mean, even on research, I think increasingly less, we index on you have to have a PhD in AI. Right? I think this is a field that people can pick up fairly quickly. I also came into the company as a resident without much formal AI training. And I think correlated to what Nick said, I think one important thing is for our new hires to have agency. Right? Opening as a place where you're not gonna get so much of a, "Oh, here's today, you're gonna do thing one, thing two, thing three." It's really about being kind of driven to find, "Hey. Here's the problem. You know, no one else is fixing it. I'm just gonna go dive in and fix it." And also adaptability.

Andrew Mayne: Right? It's a—

Mark Chen: Very fast-changing environment. That's just the nature of the field right now, and you need to be able to quickly figure out what's important and pivot what you need to do.

Nick Turley: The agency thing is real. You know? I think we often get asked for, you know, how how does OpenAI keep shipping and, you know, you it feels like you're you're pushing something out every every week or something like that. It's, a, funny because it never feels to me. I always feel like, you know, we could go going even faster. But, but, you know, I I think fundamentally, fundamentally, we just have a lot of people with agency who can ship. That comes to product, comes to research, that comes to policy. Shipping can mean different things. We all do very different things at OpenAI, but I think the ratio of people who can actually do things and the lack of red tape, except where it matters, there's a couple areas where I think red tape is very, very important. But I that is what makes OpenAI very unique, and it obviously affects the type of people who we wanna hire, too.

Andrew Mayne: I was brought into the company because I was originally given access to GPT-3, and I just started showing all these use cases for it and making videos every week for. Yeah, and that was annoying people, I'm sure,—

Mark Chen: But I was was not. It was really fascinating.

Andrew Mayne: It was exciting. It was an exciting time. I described it to people like they, I think they built a UFO and I get to play with it. And then I make it hover and like, "Oh, you made it hover." I'm like, "Well, they built it. I just pressed the button and got to do that." But that was just what I found very empowering, was the fact that I I'm self-taught. I learned to code by Udemy courses and stuff, and then to be a member of the engineering staff and be told, "Just go just go do stuff." Nothing too critical. I didn't break anything to anybody. And that's good to know that that kind of spirit is still there. And I think that is part of the reason why OpenAI is able to ship even though, you know, it was like 150, 200 people worked on GPT-4. I think people forget about that. You know?

Nick Turley: Totally. And and honestly, this is how and even ChatGPT, this is how how it came together. You know, we we had a research team. They'd been working, you know, for for a while on instruction following and then the successor did that and, you know, post-training these models to be good at chat. But the product effort came together as as a hackathon. I remember distinctly. We said, like, "Who who who who's excited to, you know, go build consumer products?" And we had all these different people. Like, we had a guy from the supercomputing team who, you know, was like, "I'll make an iOS app. I've done that." Mhmm. "In past life where we had, you know, a researcher who wrote some back-end code," and it was just convergence of people who were excited to do stuff and I think the ability to do so. And I think that's how you get the next strategy, is running an organization where where where that is possible and continues to be possible as you scale.

Andrew Mayne: Hackathons were my favorite thing, because one, being a performer and loving show and tell. But it was just neat to be able to see things that you knew were gonna be a product or something later on. Because when you're playing with the technology this advanced and all of that, do you guys still do them?

Mark Chen: Yeah, absolutely. Okay. Yeah. We've had some fairly recently, and they are typically tied. Last week, actually.

Nick Turley: Can't say what it was about, but it was. Sure. And it's how you find out what's possible.

Andrew Mayne: Yeah. I'm excited to hear that. I do have a question, which is how much as it grows, again, like, when I started, I think, like, 150 people on the company. Now there's, like, 2,000. And then now, you know, I see a video with Sam talking to Johnny Ive. And how much is that gonna change? The character, the spirit of bringing in all this? Think all the outside expertise has been great. We've seen this great sort of run of products. But do you see it changing the culture?

Mark Chen: Well, I mean, I think probably in the right way. Right? It's like I think when we look at AI, we don't think of it as some fairly narrow thing, and we've always been kinda enthralled by just the potential and all the different things you could build with AI. And, yeah, to to Nick's point, right, this is why we're able to ship so quickly because people imagine all these different possibilities. They imagine the future with AI, and they try to bring it about. Right? And I think these are facets of that imagination. Right? It's like, what does AI look like if you imagined an AI-first device, for instance?

Nick Turley: Yeah. When you go from 200 to 2,000, you'd think a lot would change. And, yeah, maybe in in some ways it has, but but I think people often underestimate, you know, the number of things that we're doing. I always feel like being at OpenAI feels much closer to being in a university where, you know, you've got this kind of common reason to being there, but everyone's doing something different, and you'll sit down at dinner or at lunch, and you'll talk to someone and learn about their thing. And you're like, "Wow, that's so cool that you're doing that." And so it feels much smaller because I think of the sort of broad range of things we're doing, and therefore each individual effort, whether or that's something like ChatGPT or something like Sora or etcetera, is actually staffed in a very, very conservative and lean way that then continues to keep people very autonomous and make sure they have resources, etcetera. I think it's partly that that has made it feel very, very similar in the good ways to when I started here.

Preparing for an AI-Driven Future

Andrew Mayne: We talked a bit about one of the things you look for is curiosity, and Mark said that's helpful, too. If I'm somebody outside of AI, okay, if I'm 25 or I'm 50, and I'm looking at the advancement of technology and maybe having a little bit of fear because I see copywriting is one of the things that ChatGPT got great at. Writing code is great. I personally have the opinion that we'll never have enough people creating code because there's more things code can do in the world than we can imagine. And even the thing that places the copy. My wife showed me the other day on her her skin block or sun block lotion bottle. She showed me on her sun block lotion bottle like some very funny copy about like the ingredients. I said, "Oh, this is not a place I expected to see this," but that's one of the tiny little places that all of a sudden that you can put more thought into it. That being said, I know that I'm a bit of an optimist because I see all these opportunities are places to go in there. What advice do you give people, whatever point they are in life about preparing for or adapting to or being part of the future? I like how Mark just looked right to me.

Mark Chen: Oh, no. You take this. I can go. Okay. I will jump in right now. Yeah. No. I think—

Nick Turley: The important—

Mark Chen: Thing is you have to really lean into using technology. Right? And you have to see how your own capabilities can be enhanced, how you can be more productive, more effective by using the technology. I fundamentally do think that the way this is gonna evolve is you will still have your human experts, but what AI helps the most is the people who don't have that capability at a very advanced level. Right? So if you imagine, right, like, as these models get much better at health care advice, they're gonna help people who don't have access to care the most. Mhmm. Right? Image generation. Right? It's not producing, you know, an alternative for, you know, experts or, you know, professional artists. It's allowing people like me and Nick to create creative expressions. Right? And so I think it's kind of rising the tide that allows people to be competent and effective at a lot of things all at once, and I think that's kind of how we're gonna see a lot of these tools bootstrap people.

Nick Turley: The world's gonna change a lot, and I think truly everyone has a moment where the AI does something that they considered sacred and human.

Andrew Mayne: You know a guy that got vested and or felt very threatened about his achievements in code and abilities.

Nick Turley: Well, that happened for me a long time ago. Let's be talking about—

Andrew Mayne: Someone else in the room.

Mark Chen: Oh, yeah. I mean, yeah. It's definitely better than me at a lot of code problem-solving, for sure. Yeah.

Nick Turley: Right. So I think it's deeply human to to to feel some level of awe, respect, and maybe even fear. And I think to Mark's point, be it actually using this thing can demystify it. I think we all grew up or, you know, learned about the word AI in a world where AI meant something pretty different from what we have today. You've got these algorithms that, you know, try to sell you things, try to do things, and or you've got movies, you know, where the AI takes over, etcetera. And, like, that term means so many things to different people that I'm entirely unsurprised that, you know, there's fear. So. Actually using the thing is is, I think, the best way to have a grounded conversation about it. And then I think from there, the best way to prepare, I think there's some degree to which you need to understand the products and keep up, sure, but I think things like prompt engineering or sort of understanding the intricacies of this AI, they're kinda not the right direction. I think sort of there's fundamental human things like learning how to delegate. That is incredibly important because increasingly, you know, you're gonna have an intelligence in your pocket that it can be your tutor, it can be your adviser, it can be your software engineer. It's much more about you understanding yourself and the problems you have and how someone else might help than a specific understanding of AI. So I think that's gonna be important. Curiosity, I mentioned it earlier. I think asking the right questions, you'll get you only get what you put in. Right? That's important. And I think fundamentally being ready to learn new things. I think the more you learn understand how to pick up new topics and and and domains, etcetera, the more you're gonna be prepared for a world where, you know, the the nature of work is shifting much faster than it's ever shifted before. So I'm prepared that my job, you know, in product is is gonna look different or not exist at all, but I am looking forward to picking up something new. And I think as long as you bring that perspective, you're well set up to leverage AI.

Andrew Mayne: I think we sometimes over index on Sometimes certain jobs go away because we don't really need a lot of, you know, typewriter repair people anymore. Right? And then certain kinds of coding jobs are probably gonna go away. But like I said, I think there's way more opportunity for coders or people to create code however it's done. And you mentioned the health field. And that's one of things I hear people like, "Oh, when we replace everything with AI," well, I would be very happy having an AI diagnose me, operate on me, and probably do everything else. But I do want somebody there to talk me through the procedure and hold my hand. But also, want people asking questions. You know, every day I take a bunch of vitamins. Is this the right time of day to take it? You know? I can't bother my doctor with all these silly little questions.

Nick Turley: I really don't think you end up displacing doctors. You end up displacing not going to the doctor. You end up democratizing the ability to get a second opinion. Very few people have that resource or know to take advantage of a resource like that. You end up bringing medical care into pockets of the world where that is not readily available, and you end up helping doctors gain confidence. I've often heard from doctors that they already talk to existing colleagues to get a second opinion. In some cases, that's not possible, and I think you'd be surprised by the number of doctors that use ChatGPT. Now on things like medicine, there's work to make the model really, really good, and we're excited to do that. There's also work to prove that the model is really good because I think you're not gonna trust it until there's some degree of legitimacy. And then there's work to explain the areas where the model might not be good because increasingly, once it gets to human and then superhuman level performances, it's hard to frame exactly where it will fall short, which is also hard hard to sort of reckon with. But nonetheless, I think that opportunity is one of the things that gets me up in the morning. Education might be the other one, and I think there's a tremendous opportunity to help people.

The Future of AI: Reasoning, Agentic Models, and New Form Factors

Andrew Mayne: What do you think is gonna surprise us the most in the next year to eighteen months?

Mark Chen: I honestly think it's gonna be the amount of research results that are powered, even in some small way, by the models that we've built. And one of the kind of quiet things that's taken the field by storm is the ability of the models to reason. And you already see some research paper.

Andrew Mayne: I'm gonna make you explain when you say reason.

Mark Chen: Yeah. So this fits into the—

Andrew Mayne: I want you to reason through. The question as you explain reasoning. Out loud. Yeah. Yeah. Think out loud. Exactly. Your traces.

Mark Chen: Yeah. This this really fits into this agentic paradigm that we were talking about earlier. And the way that the models approach solving a problem that takes some time to solve is that it reasons through it, much like you or I might. Right? If I give you a very complicated—

Andrew Mayne: Puzzle, reason probably much better than I do, Marco.

Mark Chen: I mean, I think I'm flattered. With a, yeah. Like, a a complicated puzzle. Right? You might think to yourself, for instance, let's just use a crossword puzzle. Right? Like, you might think through all the different alternatives and what's consistent. You know, is is this row kind of consistent with that column? And you're searching through a lot of alternatives. You're backtracking a lot. You're trying a lot of your hypotheses. And and then at the end, right, you come up with a well-formed answer. And so the models are getting a lot better at that, and that's what's powering a lot of the advancements in math, in science, in coding. So this has reached a level where, today, in many research papers, people are using o3 almost as a subroutine. Right? There's subproblems within the research problems they're trying to solve, which are just fully automated and solved through plugging into a model like o3. I've seen this in several physics papers. Talk to physicists even where they're like, "Wow. Like, had this expression that I couldn't simplify, but o3 made headway on it." And and these are coming from some of the best physicists in the country. So I think you're gonna see that happen more and more and more and more, and we're gonna see just acceleration in in progress in fields like physics and mathematics.

Nick Turley: It's a hard one to beat because, you know, I would swap many things we do in exchange for making a true, you know, significant, you know, scientific advancement. But I think we can we can have multiple of these things. I think for for me, it's it's the fact that any well-described problem that is intelligence constrained—

Andrew Mayne: Mhmm.

Nick Turley: I think will be solved in in products, and I think we're we're fundamentally just limited by our ability to do that. So what what that mean is, like, you know, companies in the enterprise, there are so many problems that are fundamentally hard that the models are not smart enough to do yet, whether that's software engineering, whether that's running data analysis, whether or not it is providing amazing customer support. There's all these problems that the models fall short at today that are very, very easy to describe and evaluate, and I think that we'll make tremendous progress at those. On the consumer side, these problems exist too. They're a bit harder to find just because consumers are worse at telling us exactly what they want. That's the nature of building consumer products, but I think it's very, very worthwhile where, you know, there's many hard things we do in our personal life, whether it's doing taxes, whether or it's planning a trip, whether or not it's searching for a high consideration purchase, whether or not that's a house or a car or a piece of clothes. All of those things are problems where we need just a little bit more intelligence and the right form factor. So I think the other thing that's gonna happen in the next year and a half is you'll see a different form factor in AI evolve. I think chat is still an incredibly useful interaction model, and I don't think it's gonna go away, but increasingly, you're gonna see more of these sort of asynchronous workflows. Coding is just one example, but for consumers, it might be sending this thing off to go find you the perfect pair of shoes or to go plan a trip or to go finish your taxes, and I think that's gonna be exciting, and we're gonna think of AI a little bit differently than just a chatbot.

Andrew Mayne: One of my favorite examples, both from a utility point of view capability and then UI, was deep research, and deep research is probably the best example we maybe have of probably agentic sort of model use right now because it used to be you'd ask for a model to tell you about a topic. It would you'd either get the data or just do a big search of the Internet, and then it would just summarize all that where deep research will go find some set of data, look at it, ask a question, then go find some new data and come back to it and keep going on. And I think the first time I used it, other people used it, like, "Wow, this is taking a while." And then you added a UI change so I can go away and go do something else. And then the lock screen on my phone will show me this is working, which was a paradigm shift. And I talked to Sam here about that. And Sam said that was a surprise to him, was the fact that people would be willing to wait for answers. And now I've seen a new metric for models as how long a model can spend trying to solve a problem, which is a good metric if it ultimately solves it. And that's, has this been an update to you in how you think about these things? The idea of like, "Oh, we don't just want," and I guess you talked about this before about agentic, and the idea that it's not just give me the answer. It's like, "Take your time. Get back to me."

Nick Turley: I think to build a super assistant, you gotta relax constraints. Like today, you have a product that is entirely synchronous. You have to initiate everything. That's just not the maximally best way to help people. Like, you think about a real-world intelligence that you might get to work with, it has to be able to go off and do things over a long period of time. It has to be able to be proactive. So I think there's, like, we're we're sort of in this process of relaxing a lot of the constraints on the product and on the technology to better mimic a very, very helpful entity. The ability to go do five-minute tasks, you know, five-hour tasks, eventually five-day tasks, a very, very fundamental thing that I think is gonna unlock a different degree of value in the product. So I've actually not been that surprised that people are willing to do that. Like, I don't really wanna be sitting around waiting for my coworker either, and I think if the value is there, I'd gladly be doing other stuff and come back.

Mark Chen: Yeah, and we really don't do it just because, right? We do it out of necessity. The model needs that time to solve the really hard coding problem or the really hard math problem, and it's not gonna do it with less time, right? You can think about this as, "I give you some kind of brain teaser." Your quick answer is probably the intuitive wrong one, and you need that actual time to work through all the cases to, like, are there any gotchas here? And I think it's that kind of stuff that ultimately makes robust agents.

Andrew Mayne: We've seen kind of there's the paper of the moment where somebody comes out and says, "Ah, I found a blocker." And I remember there was one a month or so ago, and they said models couldn't solve certain kinds of problems, and it wasn't hard to figure out a prompt that you could train into a model and could solve those kinds of problems. And we had a new one that talked about how they would fail at certain kinds of problem-solving ones. And that was kind of quickly, I think, debunked by showing that the paper kind of had flaws in there. But there are limitations. There are things that there might be some blockers or things we don't know are going to be there. I think brittleness is one of the things. There is a point where models can only spend so much time solving a problem. We're probably at a point where we're only having the model maybe two systems watch each other, and we have to think about how a third system stops, the wait for things to break down. But do you see any blockers between here and where I'm getting the models that are going to be solving, doing things like coming up with interesting scientific discoveries?

Mark Chen: I think there are always technical innovations that we're trying to come up with. Fundamentally, we're in the business of producing simple research ideas at scale. And the mechanics of actually getting that to scale are are difficult. Right? It's a lot of engineering, a lot of research to kind of figure out how to kind of tweak past a a certain roadblock, and I think those are always gonna exist. Right? Every layer of scale gives you new challenges and new opportunities. So, you know, fundamentally, approach is the same, but we're always encountering new small challenges that we have to overcome. Right.

Nick Turley: Just to build on that, I mean, the other business we're in is in building great product with with with with great products with these models, and I think we shouldn't underestimate the challenge and amount of discovery needed to really bring these ever intelligent models into the right environment, whether or not that's giving them the right sort of action space and tools, whether or not that's really being proximate to the problems that are hardest, understanding those and bringing the AI there. So I I think there's, you know, the the technical answer, but I think there's also the the the, you know, real-world deployment, and I think that always has challenges that are very, very hard to predict yet worthwhile and part of our mission to do this all.

Favorite Uses for ChatGPT

Andrew Mayne: All right, last question, and I'll begin. It's what's your favorite use or tip for ChatGPT? Mine is I take a photograph of a menu, and I'm like, "Help me plan a meal," or whatever if I'm trying to stick to a diet or whatever.

Nick Turley: See, I really want that use case, but I've been trying it for wine lists, and that is my eval on multimodality. Still doesn't work. Really? It keeps embarrassing me with, like, hallucinated wine recommendations, and I go over it, and they're like, "Never heard of this one." So I'm glad yours works. I, but for me, that's the that's still a use case.

Andrew Mayne: Well, I mean, it could, I maybe the line lens is too dense. That was a problem that was a problem with the Operator, was it, like, originally was the division models, that too much dense text, it just loses its placement.

Mark Chen: Yeah. I mean, speaking to Deep Research, I love using Deep Research. And, you know, when I go meet someone new, when I'm gonna talk to someone about AI, right, I just preflight topics. Right? I think the model can do a really good job of contextualizing who I am, who I'm about to meet, and what things we might find interesting, and I think it it really just helps with that whole process.

Andrew Mayne: Very cool.

Nick Turley: I'm a voice believer. It's still got, I don't think it's entirely mainstream yet because it's got it's got many little kinks that all add up, but for me, you know, half of the value of voice is actually just having someone to talk to and forcing yourself to articulate yourself, and I I find that to sometimes be very difficult to do in writing. So on my way to work, I'll use it to process my own thoughts. And with some luck, and I think this works most days, I'll have the restructured list of to-dos by the time I actually get there. So voice for me, it needs to be the thing that, you know, I both love using and wanna see improve over the next year.