Building Better Systems

#9: Tycho Andersen – Commit Log Spelunking

Episode Summary

Tycho Andersen shares lessons that Linux kernel developers have learned from decades of open-source interactions. We discuss how the open-source community works together to make the Linux kernel better for everyone, and also what it's like to work debugging the kernel.

Episode Notes

Watch all our episodes on the Building Better Systems Youtube channel.

Joey Dodds: https://galois.com/team/joey-dodds/

Shpat Morina: https://galois.com/team/shpat-morina/

Tycho Andersen: https://tycho.pizza/

Galois, Inc.: https://galois.com/

Episode Transcription

Intro (00:02):

Designing manufacturing and installing and maintaining the high-speed electronic computers, the largest and most complex computers ever built.

Shpat (00:22):

Everyone. Welcome to another episode of building better systems. Um, this is a podcast where we get together with people at the forefront of firefighting and working on making systems better and building the tools and approaches that kind of help them get there. Um, my name is Shpat Morina.

Joey (00:41):

And I'm Joey Dodds. Today. We're joined by Tycho Anderson and Tycho's in engineer at Cisco working on Linux kernel security and, uh, Colonel stability. Tycho's spends his days working to solve really complicated problems that are going on in the kernel that are causing both Cisco and other companies to feel pain and causing their container infrastructure to suffer. And Tycho basically works tirelessly to make as many of those problems as possible go away while also improving the Linux ecosystem for everyone. Uh, he holds degrees from university of Wisconsin, Madison and Iowa state university. And it sounds like he's been an open source contributor for quite a long time. So this is a really, I think, ideal job for, for Tycho. Uh, thanks for joining us psycho. Yeah. Thanks. Thanks for having me. I'm excited to be here now that we've informed ourselves thoroughly about, about Tycho. I'm going to, I'm going to kick us off with our, our usual question.

Joey (01:45):

Um, Tycho, what's your approach to building better systems?

Tycho (01:48):

Oh, well, uh, I, unfortunately I don't know that I have a very good one. I think, um, I feel like some days my job is just professionally catching knives, um, and I have never had any formal training in knife catching. So, you know, I, my, most of my advice probably today will is informed by, uh, just a long time of, you know, building stuff and, um, and seeing what goes wrong, uh, sort of at the core, I would say write more tests is good. And, uh, you know, that's, that's basically the number one thing, I guess.

Joey (02:24):

Um, yeah, so, yeah, w we've we, we brought you on today, Tycho cause, um, this is a really interesting counterpoint to some of the perspectives that we've shared so far, which have all been, been about sort of the most principled possible approaches to making systems better. Um, but in reality, most people maybe don't have the time or energy to always take the most principled approach as possible. And you have to settle for something that gets things working today that gets things working tomorrow that stops your machines from crashing. Could you tell us a bit about what this knife catching entails you do day to day?

Tycho (03:00):

Sure. Yeah. Um, so my day to day, I work on this Linux platforms team at Cisco and I've worked kind of in the container realm, um, both sort of on the kernel and the low-level user space that calls into containers and container checkpoint, restore and containers stuff. Um, since about 2013. And, uh, so what I do day to day for Cisco is I I'm a stable kernel maintainer for their platforms. And then also I do some feature development in the kernel, um, for containers. And then also just some thinking about security, although that's very hard. So, um, yeah, that's what I do.

Joey (03:36):

So what are the, what are the implications of, of kernel stability on basically like our, our day-to-day lives? Like if the kernel crashing, for example, how bad is it?

Tycho (03:47):

I guess, I guess the question is it depends on which Colonel is crashing. So, um, we, uh, you know, so I work at Cisco and we deploy a lot of Linux all over. So mostly we're at the, you know, network architecture level and all of our systems are, you know, uh, the name is escaping me right now, highly available. There we go. Um, so they all have fail over and stuff. So for the most part, if one kernel goes, goes wrong, it's it's, it's okay. It's when there's like a, an issue that affects a lot of them. You know, a lot of people have Linux running in their pockets on Android. If that kernel starts crashing, you totally screwed. So, um, I think people, people care a lot about kernel stability, uh, and have started to care a lot about kernel stability. I would say, especially in the last 10 years, um, as Linux has gotten more and more prevalent and there's been a lot of work, you know, when Linux was first a thing, you know, in the nineties, um, this was kind of before my time really, I, I really became a full-time Linux user in 2007.

Tycho (04:44):

Um, but I think there just, there wasn't a lot of testing and people kind of put patches in and, you know, uh, if it broke, you got an email and that's how you knew it broke. Um, so, you know, the rest of the world was your CII system, uh, which is fine if you're kind of a hobbyist project in the nineties, but now it's a, you know, very professional project powering, you know, gazillions of dollars of revenue. And so, uh, I think, I think the kernel community as a whole has really woken up in the last 10 years and try to do a better job of, um, you know, testing and things.

Shpat (05:19):

It's super interesting. The, you know, we hear this over and over again, it's something, you know, you're building something and it's a, it's a hobby or you're kind of being involved as a hobbyist. And then later on that code goes on to be almost critical to it. Well, very critical to a lot of things. Um, and then somebody has to go back in and work with this, with these systems that have been built with not necessarily thinking like, Oh, this is going to be, as you said, cause millions of dollars or maybe even lives, you know, depending on this thing running well. Um, I'm curious if you have any kind of thoughts, people who are writing code kind of as a hobby that might later on be, um, you know, become that kind of impactful or that you can't know that by definition, but, um, let me step back and ask this question. What can people do when they're writing code that right now might not ha you know, have other people rely on it massively, but, um, let me restart this question. Um, if, if you're working on code as a hobby and not as a profession, um, what are some low hanging fruits from your perspective that wouldn't necessarily take, make, make it kind of a professional commitment of time that you might be doing that make those systems better?

Tycho (06:43):

Yeah, it's, uh, it's a great question. Um, and I actually have, I dunno, some experience with this. Um, so my, uh, one of my source projects that I really have loved working on for a long time is this window manager called Q tile. Uh, and it's just a, it's just a window manager written in Python. And, uh, it turns out it will never be valuable to anyone. Nobody will ever pay anyone to work on a window manager because we already have too many. And there will just, nobody cares about window managers except for the weirdos on the internet who care, uh, of which I am one. Um, and if you look through the commit history of that, we know manager, you can see, cause I've been working on that project for almost 11 years now, and you can see early on my commitments to TJ's were short.

Tycho (07:26):

They're kind of, I don't know, not very good. And then basically right about the time that I started becoming a Linux kernel, contributor, everything really snaps into like very professional, um, commit messages and with links to various documentation, you know, Oh, here's this weird place in the X 11 spec where this thing happens, or one of my favorite goofy ex ever you always hear about how the X 11 API is terrible. One of my favorite goofy things is there's a in xDB, there's a thing where it uses the space in memory before the pointer you pass in. So you have to allocate some memory before the point where you pass in. And I just think that's crazy, but I, you know, the commit message for that particular thing, when I fixed that bug has a big long write-up of why this is and all this stuff.

Tycho (08:14):

And so I think, you know, it seems sort of unnecessary at the very beginning, but when that code becomes important, you or someone like you will have to do, but probably you, um, you know, if you care about the project and you enjoy it, we'll have to figure out why did I do this? And if you go back to the commit logs and you see, you know, initial commit or fix XYZ or whatever, and there's not a lot of reasoning about why a particular line of code exists, it can be hard to, um, you know, hard to figure out what's going on. So I guess the thing that I've done now, since I started working on the Colonel was in all other projects that I, I just, I write good commit messages, even for repositories of like my dot files and stuff. You do a lot of spill on cake when you enable some one line change in your bash or senior your ZSH RC or whatever.

Tycho (09:06):

And like that that's valuable time that you are, you will have wasted that if you don't record that information somewhere. And so the commit log is a great place to do that. So I think the most basic thing people can do on projects they're working on today, that aren't necessarily a big deal is just practice, good, commit hygiene, make sure everything would always bisect, even if maybe nobody will ever bisect it, but it's just a good, a good habit. And the, you know, the kernel, if you ever want to be Linux kernel contributor, that they force you to do that. I mean, your past just will not get accepted if it doesn't follow these very strict rules, which is totally reasonable. Um, but I think that's a, just a super basic thing that is very, very valid.

Joey (09:45):

This has really been a recurring theme with people that we've talked to is that we can't just look at technology. Uh, we published episode two, which was with muse dev. And one of the things they said is we can do all we want on the technology side and if people don't adopt, and if people don't actually think about how they're writing code and how they're putting code together, the technology can't help. Um, and it sounds like the Linux community. Um, and I don't know when this started, but it sounds like the Linux community is doing a really nice job of holding people to that standard of, of explaining themselves. But it also sounds like you've taken that you've taken that, that professional learning that you did and found it even valuable to apply to your personal projects, which sounds like a great, a great example of, of actually that being valuable. Right? Sometimes we do things at work and we're like, we have to tick the box for work. Fine. I take that box, they wanted me to do it. Um, but sometimes things you do professionally are really there for a reason. And it sounds like this is one of them.

Tycho (10:42):

Yeah, absolutely. Uh, I think this is, it's a super valuable thing and, uh, you know, I've become a lot more strict about it with other contributors, um, in, in my open source projects, just because I think it's important now after having watching the Lennox, people do it now, now you're that guy now I am absolutely that guy and I am totally unapologetic about it. Uh it's it's, this is a super important thing. So it's brilliant. It's excellent.

Joey (11:09):

And it sounds like the bar, the bar for doing that, it sounds like, is like, am I going to maybe work on this next year? And if, if the answer to that is yes, then go ahead and just take that extra minute to write a good commit message.

Tycho (11:22):

Yeah. I mean, I sort of think about it like, well, you try and eat healthy. Now you try and exercise. Now you try and save Bonnie for your future retirement or whatever. Like you do all of these other things to take care of your future self. Like here's one, here's one more way in which you can take care of your future self because odds are, it's going to be you. So, um, yeah, totally. I viewed exactly the same,

Joey (11:44):

But it sounds like it's, it sounds like Linux is maybe in the situation where it, uh, you know, went off to college in the early nineties and gained as 15 and, uh, you know, maybe, maybe tried a few experimental things and, and things got a little wild and maybe part of your job is dealing with that at this point.

Tycho (12:03):

Right? I, yeah. I mean, so I think the, if you look at Linux commit messages, they have always been very good even before things like get existed. I think they were very good about that now. Yeah. The thing that maybe wasn't so good and has gotten better to your point is the testing. Um, you know, now there's a Linux found Def foundation staffer, shoe Aecon, and her job is to maintain case self test to make sure it gets run and that kind of stuff. Whereas I think she was hired in 2017. So like pretty new thing to have a Linux foundation fellow who just focuses on testing like this isn't, you know, we've, we've had, um, other testing and infrastructure projects. XFS tests has been around for a long time. Cause the file system people, you know, need to make sure their stuff isn't broken, but it, that's not part of the kernel repository.

Tycho (12:54):

There's no integrated CEI or at least there wasn't at the time, um, when they started that project that there's, um, like the Intel, uh, the Intel build bot basically just tries to build everything that gets sent to El KML. And I think a lot of build failures started getting caught that way because it doesn't just build x86. It built in spite of being run by Intel, it builds all the architectures. And so, you know, your thing may make, may compile on stuff that most people use, but like extensor or whatever, you know, might break that compile. And, and you'll get an email about that these days or, um, you know, the, I think maybe the most important project is the SIS color project. Um, by some guys at Google, I think it's led by Dimitri Voq off. And that's been a super valuable project, you know, over the course of the last five or eight years, however long it's been going. So, uh, the kernel community is growing up. Um, but there was some, I think some work to lose the freshmen

Joey (13:49):

Indeed, but I mean, it sounds like, like even good commit messages in the nineties, I think was probably like a, it's kind of a big deal, right? People weren't doing much CII in the nineties, so you can be, you can certainly be forgiven for not having a CII built in the nineties. And that like maybe every buddy that had to approve it would download the thing themselves and try to build it and sign off. And maybe sometimes they sort of skipped it cause they looked good to them and they didn't have time. Um, but it sounds like they've been pretty close to as, as quick to adopt these good practices as anybody has basically.

Tycho (14:21):

Yeah. I mean, certainly around committed messages. I, I, I've never seen a project that was anywhere as good as the Linux kernel. I mean, to the point where like, if you watch some of these interviews that Linus does at various conferences, I mean, he talks about he's like, I almost don't care about the code change. He's like the commit messages, you know, the part people are gonna inspect and, you know, I mean, the other thing about the kernel is like, there are probably hundreds, maybe thousands of engineers who are gonna look at all this stuff because, you know, there's a big team of guys. When I at canonical, they had 25 kernel engineers who were backporting stuff to there across their kernels. Um, I think red hat has an even bigger team. You know, Susie has a team. I worked for a team at Cisco where I regularly spend time spilling through the kernel logs, trying to find, you know, information about back traces and all kinds of stuff. So there, I mean, there's, you know, hundreds or thousands of engineers all over the world that are going to try and figure out like, Oh, I have this bug, you know, generally when I'm looking at a kernel bug, it's in some subsystem that I don't really know that well. Um, so I'm, I'm sort of grasping at straws already. So an explanation of like a healthy explanation in the, in the commit log is, is very, very helpful.

Shpat (15:31):

Hey, you touched on this a little bit. Um, but you know, when we were doing prep for this podcast, you were telling us a little bit about your job and it sounded, it sounded really cool. I mean, you, you said knife caching, but the way I saw it was basically you were called in to do some, you know, firefighting, uh, firefighter meets Sherlock Holmes kind of stuff. Um, as it relates to kind of kernel stuff. I wonder if you could like tell us a little bit about what, what that entails day to day and what that looks like.

Tycho (16:04):

Sure. Yeah. Um, so we have, um, I think probably like any large organization, we have some levels of support and, um, you know, when, when, when the customer hits a kernel bug, you know, it goes some to have some frontline support engineer and, you know, it goes up various levels until finally, um, most of the time, by the time it gets to me, it's been touched by a bunch of people. So the system is not in its original state. Uh, often we have a back trace. Um, in fact, I enabled the use of a subsystem called piece store in the Colonel where we actually, there's a, there's a small writeable section in the ACPI tables. And now when, uh, one of our devices core dumps, we write this core dump stack trace into, or sorry, the kernel panics, we write the kernel panic step trace into the ACPI tables because often we won't system D will rotate all the way, the logs before I even get it.

Tycho (17:02):

And so the only place that we get this stack traces from this like little rideable section in the ACPI table. So often what I start with is, uh, you know, an explanation of here's a Colonel stack trace, you know, please fix this bug. And sometimes it will come accompanied with an explanation of, Oh, we were doing this at the time, but usually it's like a user who's like, Oh, I clicked on this JavaScript icon in the app. And that's, you know, not really connected to what happened, you know, on the backend when the Colonel crashed. So, you know, um, the first thing to do is, you know, Google bits of the stack trace and there was like, you know, here is sticks about how you can splice bits to just to see if anybody else has seen it. Um, you know, is there, are there any other bug reports from this?

Tycho (17:46):

You've a lot of times there's, you know, the, uh, kernel bug tracker, but that's not so good. LK ML have some reports. Um, so there's various places. Um, one of my favorite bugs, probably my all time favorite bug that I've fixed while working at Cisco, um, it was a bug in the TTY subsystem, which is like the, the subsystem that, you know, does all kind of the low level, text input and output, um, when you're interacting with it, like over a serial console or something, and about, I think he was maybe 15 or even 20 years ago at this point, uh, there's a guy named Alan Cox who worked for red hat and, uh, he was the, the maintainer of the TTY subsystem. And he, uh, you know, got into a fight with Linus about something Linus. Didn't like the way he had handled something. And, and, and Alan Cox said, well, you know, fix it yourself.

Tycho (18:36):

I'm not dealing with this anymore. And the TDY system has not seen a lot of love. I think Greg Crow Hartman is the maintainer now. And he's just a so swamped. He doesn't have enough time to really sit down and, um, I mean, he'll fix bugs if people report them, but there's no active development going on. So anyway, um, I get this, this bug about this, uh, I get the Colonel stack trace and I'm looking at it and it's, you know, it's clearly a race condition and I'm trying to think like, well, how did this happen? And it turns out basically that there was a critical section that was just protected by two different locks. So one function use log X, different function, use lock lie. They both look reasonable until you actually look at, well, this is a log of X and this is lock wide.

Tycho (19:17):

So it turns out the raise conditions pretty easy if you're using two different locks. So anyway, the result of this was it turns out that Alan Cox had been right in the middle of a refactoring when he got annoyed of Linus and quit. And probably this is like a thing that would have gotten fixed if he hadn't gotten annoyed and stayed working on it because he was in the middle of shifting to this new locking scheme. So anyway, the TTY sub system has these two different locking schemes basically because of this, uh, sort of people issue. So anyway, all this is to say, um, you know, you can figure this out, going back to, I figured this all out from the commit logs and looking at mailing lists, you know, postings and timings, but this is like sort of the spelunking that you do. Like you guys were saying, Sherlock Holmes, Sherlock Holmes was really smart. I don't know that I'm a Sherlock Holmes, but you, you spend a lot of time more so than writing code, even just reading commit logs and, and doing this spelunking with stack traces and that kind of stuff, um, more than anything. So I guess that's my answer is that, is that I don't know.

Shpat (20:19):

Fantastic. It's super interesting to see, not only to fix something, but also to kind of dive into like, how did it come to be from this essentially open, uh, historical kind of catalog of, of everything that had happened, both has commits committed messages, but also just, you know, socially what happened, right. We talked about commit messages. Um, Linux community is able to keep things relatively bug free and not with more and more kind of it being a critical part of, of, of a lot of things.

Joey (20:54):

Well, it's good enough that my phone doesn't usually turn off in the middle of the day. Right. Which is really good considering how much work is going on.

Shpat (21:01):

Exactly. And it's still kind of a community process, uh, an open community process, at least the way I understand it, naively. Um, I'm wondering if, what does that process look like today? And, you know, if people, if there's other people working on things that are relatively critical, what can they steal from that, from that whole thing, especially if they want to integrate a community in the process.

Tycho (21:26):

Yeah. I mean, I think the vision of, uh, the Linux kernel as a community processes totally correct. Um, now it's not a community of hobbyists sitting in their basements. I get paid full time to work on Linux Linux. Um, you know, I, uh, uh, various forms, whether it's userspace or the kernel, um, but you know, I get paid. And for example, this past series that I'm working on now, uh, we have people from Google. Uh, we have people from red hat. Uh, we have people from canonical, all making the argument that, Hey, this should go into the clinics, Colonel. Um, and you know, usually it's especially, so the thing that I'm working on now is, um, the it's basically a, it's a container feature for you can UID shift, um, file systems. This is what it is. Um, so it's, uh, it's honestly, it's a core piece of the container infrastructure that hasn't been implemented before.

Tycho (22:20):

And it's kind of crazy that it hasn't, or it's, it's been implemented, but nothing has been merged. Um, but it's, it's a fairly big patch series. Um, and I'm only, you know, small to patches part of it. Um, but it's, it's such a big series that, uh, you need collaboration from a lot of people because if one person sends this big, huge patch series and says, Hey, I want to do this crazy thing. Everyone else is going to go now just, just figure out a different way, um, where now we have collaboration from all these different companies. And we all have said, yes, we want this. And here's the way we're going to use it. And all of our use cases are different. Um, and there's been versions of this, you know, that if surface, before that wouldn't have satisfied, Cisco's use cases. And so we've had to go to conferences or, you know, uh, since coronavirus do the online thing and, and sort of, uh, you know, sway people, you know, we get in, we get in these conferences and we sit in the rooms and we say, well, here's our use case.

Tycho (23:16):

Your solution doesn't work, but if you modify it like this, then, you know, we're willing to lend engineering time and the four minds of my time to help you. Um, and so that, you know, it, it really is sort of designed by committee for all of the good things and all of the faults as well. Um, and you know, there's thing I mentioned before is, you know, the ABI is stable and so it can never be broken. So once an API goes in, it will always be there. So it really pays to think about the API before you put it,

Shpat (23:48):

Maybe the fact that it's so permanent kind of makes that level of thinking about it and being deliberate, almost a must.

Tycho (23:58):

Yeah. Which is maybe one of the reasons this has not existed for the last, however long, even though it seems like an obvious feature, but it just incarnations before this current one that we started working on maybe in February, um, just wouldn't have solved all the problems that we needed to solve. So the community is pretty conservative about adding new APIs, which can be annoying because, you know, we've been feeling this pain for the last five years and, you know, Buntu has one set of, out of tree patches and their Colonel and other people solve this other ways. But you know, now that we've all been kind of doing this at the painful way for a while, we have a little bit better insight about what we actually need in order to accomplish our goals. So, um, you know, the fact that we can design one API that does lots of things for lots of people probably means that it's a fairly reasonable primitive, uh, instead of just a one very narrow use case that one person,

Joey (24:51):

I think it's really amazing, you know, the, the, the myths you dispelled earlier, it's all hobbyists hacking in their basements. Um, I think a lot of people maybe still hold that about open source, but it's almost more amazing when you think of the fact that not only are, you know, you, you mentioned some in some sense downsides for the company, right? Like I'm sure a company that, you know, Cisco would maybe rather not have to argue its case to other companies, but it's also a clear indication of the value that Linux is bringing to all of these companies that you stick with it. So for every bit of designed by committee, obviously you feel that you're getting that much effort from every other company that's, that's playing part in this game and that everybody's kind of winning together, which is a really inspiring story.

Tycho (25:36):

Yeah. I mean, uh, you know, uh, it's much easier for me to spend a few, uh, conference evenings or whatever, arguing with people about how an API design should look than implementing, you know, the whole thing from scratch myself or with a small team or whatever. So it, you know, there absolutely is, is a lot of value created in just having this thing, that for the most part, you can grab off the shelf. And I mean, we have to build all our own kernels and we, we have a lot of infrastructure to build all our own packages and all this stuff. Um, so, you know, it's not, it's not free, but, uh, it is, I think basically as long as Linux is around, you know, you're just going to have to pay guys like me to just babysit it. Um, but paying guys like me a bit to babysit Linux is, is much cheaper than paying guys like me to write your own operating system from scratch. So

Joey (26:28):

Yeah, well in the work you do, you know, Cisco could arguably fund its own kernel development if it needed to, um, and it can fund you to take care of things, but there's tons of companies that start up depending on Linux that could never, you know, that for years couldn't afford a single kernel developer. Right. And they all get to benefit as well. It's yeah, absolutely.

Shpat (26:50):

Sure. Wonderful. Yeah. It's a weird, my cynical self is like, what's the catch, but in a way, you know, it's all, whatever the rising sea, something, something, whatever the medium is something about ships, but, uh, yeah, it, it, it seems like everybody benefits from this in a way. And the, and the downside is that you get to argue a lot and, and, you know, get to work with people really close to the convince that it's one way or the other.

Tycho (27:18):

Yeah. And, and, you know, I mean, I think there's a fair amount of people. Like when I first got a job working on Linux, my dad was like, so what do you like, what's the value prop? Like, why are they paying you? You know? Um, and I think it, it, you know, it has taken maybe some companies longer to understand, um, the value prop, maybe Cisco is even one of those, um, that you know, that maybe they should start participating in giving back. Um, but people are getting, like, people are figuring that out and they're getting that message. So, um, yeah, it's, it's really nice and it's fun cause, uh, I'm doing the same thing I would be doing anyway. I've been a Lennox desktop user for a long time and it's painful. So he's now going to get paid to fix all that stuff.

Joey (27:59):

I think, I think a lot, I like, I wish a lot more communities, including, I guess the formal methods community, which is the one we play the heaviest part in, could learn from this a bit. And I think it Galois, we've started seeing this even internally, we've historically had a lot of siloed projects and all of a sudden things are starting to, to meet in the middle. And it there's a lot of friction there. Right. As you mentioned, sometimes you're doing design by committee when you just want to implement the thing. Um, but, and you go through that pain, but then like the next day you see like, Oh, you know, I wanted this new feature. I can't afford it on my project, but great news. Somebody else was doing it anyways. And I just got that. Right. Like I didn't even have to, I didn't have to lift a finger and I would love to see that in that community, across companies as well, like I think it would be like it's a real win for a lot of communities and Linux feels like the standout example.

Shpat (28:54):

So I want to change, uh, kind of gears a little bit. You mentioned API APIs earlier. Um, and I suspect that if I'm an engineer developing conventional applications that run on software systems, that word means something very different than it, what it means to you developing essentially, or like hacking on the Linux kernel. I'm curious what those differences are. Um, and then I'm curious if there is something for those two groups to learn from each

Tycho (29:25):

Yeah. A fair question. Um, I think it's, in some sense, it's the same thing. It just looks different. So that the API for like a Linux system call for example, is you put some, you know, depending on your architecture, you either put some information in registers

Tycho (29:42):

And then some more information on the stack or it's all on the stack or whatever. Um, and, uh, then you issue a into 80 or whatever, or the Cisco instruction or whatever your thing is to trap into the Colonel. And the Colonel reads off information out of the registers out of, uh, out of your stack or your heap or wherever, and then proceeds to operate on it. So, you know, that looks a lot different from a, you know, I'm an HTTP post, this Jason blob to this end point and then something happens and then I get this result back.

Tycho (30:11):

adding system calls to do things. Um, and so, you know, that the, the mechanism I just described, uh, was the way that you added new functionality, Linux kernel, um, then for awhile, I think for whatever reason, it became very out of Vogue or impolite to add new service calls. And so people put a lot of information and, and various things in slash products. So for example, if you're turning on a Linux security module and LSM like se Linux or, uh, app armor or something like that, you actually do that by writing to a proc file. There's no system called for that. Or if you're configuring a username space, if you're trying to set up the ID mappings, you do that by writing to some file in slash prop. Um, so those are API calls, but they're really, you know, rights to a file and they're not system calls.

Tycho (31:09):

And then more recently we've kind of gone back to in Linux, the system called model, um, for, you know, a variety of reasons that are, I don't know, more just specific to Linux than anything else, but the most recent, uh, SIS call rash of Cisco calls of people who have been implementing our, uh, there's a document, I think maybe two years ago that was put into the kernel documentation tree about what they call extensible argument, SIS calls. And it's a way to, you know, design, um, system calls so that you add elements to the end of the structure. And it talks about how things should behave and under various conditions, if your Colonel's newer, if your user space is newer, you know, blah, blah, blah. Um, but the core of that really is versioning. You know, the Colonel as the Colonel adds new functionality, it basically wants to version the API.

Tycho (32:00):

And, uh, this extensible argument mechanism is a sort of a fancy way for us to do API versioning with, you know, just very basic memory passing and stuff. Um, but you know, the, the key takeaway, I guess, is if you have an API, you should have a way to version. And I think people who are, you know, doing Jason API APIs and HTTP can also take that away too. If you have a config file, just stick a version one at the top, just some way for you to, to add or grow or change the semantics of that without totally breaking all of your users. Um, I guess that's kind of the number one takeaway, uh, from Linux kernel API, and it's best to do that from the start. Like even when you're using your first, uh, you know, point 0.01 version, if you put the version there, um, and require it, you're going to be happy later on. It sounds like yes. And users of Q tile who are listening to this will be very angry with me because we often break the API and Q tile. Sorry,

Shpat (33:03):

Do, uh, do do, as I say, not as I do kind of thing. Yeah. Um, cool. Thank you for that. Another thing that's on my mind is when we talked about, so I want to go back to Colonel CIA and, um, kind of finding things that are wrong and finding bugs in, in kind of new code. I think my understanding is that there's some tools and approaches that are, that have, that have kind of produced a lot of really good results. I'm curious what those are.

Tycho (33:32):

Yeah. Um, the, so for CGI, I mean the story is really sort of case self texts, tests. XFS tests to a certain extent tools like FIO for performance measuring and block devices and stuff, um, and file systems. Um, those are, those are good tools for catching, uh, regressions, um, and, you know, for testing new implementations of a file system or whatever, um, for performance aspects, but there's also, uh, you know, any, any talk about kernel testing would be remiss if it did not mention, I think, uh, the work that's been going on in Google and says color as XYZ, K a L L E R, um, which is it's, it's kind of a, uh, pretty basic fuzzer, um, for the kernel it's, it's grown a lot now and it's a lot more advanced, there's a whole, um, DSL domain specific language for, uh, designing like interleaving system calls and stuff.

Tycho (34:27):

So you can get the kernel in the particular States. Um, but since caller has found lots of bugs, it's found lots of bugs in code I've written, it's found lots of bugs and other code, other people have written, it's found lots of very security, critical information leaks and stuff, you know, just behavioral bugs, kind of everything. Um, and the guy who leads the project is Dimitri Vokey of at Google, and he's done an amazing amount of work. That's very good. And they found a lot of stuff. And in particular, they have a bug tracker that's open. So if you want to fix sort of some, some are basic bugs in the Linux kernel, you know, that's a great place to start, but it's a, it's a, quite a, quite an amazing tool. And it is really, I think, has made Colonel developers pay more attention to that kind of thing, to the, you know, just validating inputs, making sure that, you know, we're not in these sort of very basic, uh, copy from user copy to user kind of, uh, length errors, that kind of thing, because they know they'll get a million reports from CIS caller if they screw that up.

Tycho (35:29):

And, you know, people kind of joke about how annoying it is that it's sending all these emails, but all these emails are real books and it, on most emails, it doesn't always have a reproducer for how it got there, but on most emails, there's a C program attached that if you run this C program, it will cry, you know, it will cause a Colonel splat. So it's very hard to ignore.

Shpat (35:49):

Yeah. I'm curious what you, what your thoughts are on what makes it such a successful tool? It sounds like, is it just because it's there at the end of the day?

Tycho (35:58):

I think so. Um, you know, I mean, I'm sure part of it is just that Google has put a lot of resources in terms of just compute power behind running lots of instances to find all these bugs, but, you know, they also hired the team to build the tool. And, uh, I think there, there are other companies now contributing to it too, but I think it's primarily driven by Google. Uh, yeah, I mean, I think just being there and they're all, they're doing lots of interesting things too. Like they're not just fuzzing at the Cisco level anymore. They're doing like USB device drivers. So what happens if I write a bunch of random drunk to this junk, to this USB device, can I make it crash in a certain way? Um, you know, just, and, and these, these are like real bugs that people could explore it. You know, if I can get, uh, re um, like root access on your machine, just by plugging in a bad USB stick, you know, that's, that might be a problem. So, um, it's, it they've been doing a lot of, I think, very good.

Shpat (36:52):

It also sounds like the, the fact that it provides an actual runnable program that then you can play with, um, as you try to fix something potentially might be, or at least to me, naively sounds like it, it might be a factor of helping you to engage with it,

Tycho (37:09):

For sure. Um, at least from my perspective, when it finds bugs bugs in my code, often it will send, uh, this Siri producer and say, um, you know, run this and that makes it very easy to, you know, just look at, look at the race or whatever. Um, so you know, it, it will, it will sometimes take a little while, cause it'll have two threads and they're trying to do something very fast, but you know, it will eventually reproduce. And that is, I would say very, very good.

Shpat (37:39):

And yeah. And you know, that makes you think about what user experience UX means for you kind of static analysis tools or just basically tools like this that are for developers, because that sounds to me like a really good kind of UX choice.

Tycho (37:55):

Yeah. If you think about any program that has an API, if you send somebody like here, like if you're, even if you have a JSON API, you know, if you're fuzzing your Jason API, it could generate a curl, you know, blob that, Hey, when I paste this curl blob with this curl command to your thing that, you know, bad things happen or whatever, but it's basically, here's an input that makes your program crash. That program is, you know, this current you Colonel or whatever, but it's, it's a similar thing. Yeah. It's a great UX. Do you think there's work to be done still in the automated testing world? Like, is there a lot that you mentioned it's a pretty straightforward fuzzer um, I mean,

Joey (38:36):

Like, say like, is there a, if you're in the automated testing world, is there a call to action where like you could maybe crash the kernel in some high-impact way and find a really impactful bug if you, if you apply your tools in this space?

Tycho (38:48):

Yeah. So I think there's probably a couple of different ways, um, in which there could be more automated testing, uh, or better automated testing. Maybe the first is on the, you know, it, I mean, it's a, fuzzer in it, you know, they have a domain specific language so they can explore the collograph a little bit deeper. Um, but it's, I think it's still, it's a pretty shallow exploration of things. Like, for example, in order to set up a container, you have to do a series of CIS calls, right. All the right information to these proc files and do a series of Morris's calls and like says caller's just not going to be able to explore that very well. So, um, that's like one of those things where in order to look like there's a, how many container on times and all of them implement some version of this series of steps and surely some of them must get it wrong and we don't have any real way to check that.

Tycho (39:38):

But also just, uh, on the kernel side right now, it has, there's like KMS and which is the kernel memory, sanitizer, Katy sand Colonel thread sanitizer. Um, so there's a lot of these like error checking, things like ease of the Colonel and a bad state on the kernel side and implementing more of those types of ideas, uh, CIS color turns all of those things on. And if any of one of them triggers a bug, then they send you their reproducer and blah, blah, blah. So, um, if, if you can upstream something like that into the Linux kernel, you don't actually have to write any more test cases because just colors, all, all running them already. You just emailed Dimitri Rukia or probably you don't have to email him, he's picking up on this stuff. But as soon as it lands upstream, he'll just turn it, you know, equals Y and his Colonel config, and then whatever bugs your think catch will, then all of a sudden be caught by all of the says color instances that everyone's running. So I think there are a few different avenues if, if you can, all of a sudden, Oh, well, we need to do this runs runtime instrumentation, but we can catch this class of bugs that nobody knows about right now. That's a useful thing too. And there's already a whole bunch of, you know, dynamic testing going on. So if you just implement that thing, that is very useful. So,

Joey (40:52):

So if you are a automated testing researcher, I encourage you not to do that because I want to do it. Um, so please don't take those projects and run with them. Cause I'm honestly pretty excited about the idea of doing that. And it sounds like a great place to apply these techniques. Um, well, we can take a page from, from the limits community and maybe do it together. Well, the second, I mean, the, yeah, the idea of broadening the impact of the current tools is also is, you know, whenever possible is definitely the right way to do things rather than building new tools. It makes adoption easier. Of course, people are already very comfortable. It sounds like receiving those emails. Um, and so, you know, it sounds like anybody going after these problems would certainly be best served by understanding this is color work and whether it could be extended in the direction that they're taking it, rather than trying to build their own thing that will result in one more form of, uh, slightly annoying, but, but respected email from, from Colonel developers, basically. Yeah.

Shpat (41:57):

Well, this was a fantastic conversation. Thank you very much for joining us and for chatting with us about, about all this. I had no idea what we're really goes on these days and kind of kernel space. So this is awesome.

Tycho (42:09):

Yeah. Thanks. Thank you guys for having me. Absolutely.

Shpat (42:12):

Yeah. Well, this has been another episode of building better assistance with Tyco Anderson and we'll see everybody next time.