>> Unknown: Alright, welcome back to CS50. This is the start of week 7. For those of you that have been playing along at home, there was no week 6 on film at least because of last week's quiz. And now that quiz 0 is in fact behind us, I dare say that you are now qualified to either laugh with or laugh at a little something like this that went around the TF's list this weekend.
[ Laughter ]
[ Music ]
>> Unknown: Hey.
[ Background noise ] Mom, Dad, this is Ryan.
>> Unknown: Oh, so this is the famous boy we've been hearing so much about.
>> Ryan: Nice to finally meet you Missus Lead and sir.
>> Unknown: So, Ryan, what kind of server do you play on?
>> Ryan: Is that on, is that on a computer?
[ Laughter ] You know what? I think I forgot the dessert in the car.
[ Background noise ]
>> Unknown: Who do you think you are bringing his kind into my house?
>> Unknown: But dad, he's a nice guy really.
>> Unknown: No, he's a newb Jessica, a newb.
>> Jessica: But Ryan and I are perfect together.
>> Missus Lead: This relationship is headed for an epic fail young lady.
>> Unknown: You're elite damn it, we don't take newbs, we prone them.
>> Jessica: Well maybe I don't like to be elite.
>> Unknown: You're insolence, for a lose.
>> Jessica: No, maybe I love that he watches VHS tapes still and maybe I love that his phone still has a cord.
>> Missus Lead: You might as well date somebody who plays Aliant.
>> Jessica: I'll take whoever I want.
>> Unknown: Over my level 80 Roadster Pirelli dead body.
>> Ryan: Mister Lead. I may run around in circles when I play Halo and I might never get a monster kill. I can't even find the space bar half the time. But I know when I found true love. And that is worth more than all the uber gear in the world.
[ Music ]
>> Unknown: Too long, did not listen.
[ Laughter ]
[ Background noise ]
[ Laughter ]
>> Unknown: It's like the fifth time I've watched that.
[ Laughter ] So a couple of announcements, problem set 5 went out the door on Friday. This recall is the forensic P set which involves you're not only recovering a secret hidden message but also recovering some photographs that we accidentally deleted. And to add to the fun you'll find that once you've recovered these photos, they are all of persons, places or things on campus and the goal ultimately after the P set is submitted is if you can with your section mates find any or all of those photographs and then photograph yourself or someone else in your section in that same place or with that same person or thing, happy cat excluded. We will award you amazing well in some form later this semester. So you have a few weeks for that scavenger hunt portion. You'll notice on the course's website too that we've been soliciting beta testers for cloud.CS50.net. So CS50 has some super fancy hardware that we bought over the course of the summer that's taken us a little longer to configure it than we would have liked. But finally we debuted this weekend and the cloud.CS50.net is essentially a cluster of servers running Linux, virtual servers actually, virtual machines. More on that in the future. That you all will very soon have accounts on. This is so that we can leave nice.fas and some of its shortcomings behind us and operate our own infrastructure entirely. And this will be particularly useful for you come final projects time because inevitably many of you will go off and want to use a different language or a different piece of software. And because we'll finally be in our own infrastructure, we can just type a few magical commands and wala, you will have with higher probability what you need. Because this P set is related to forensics, I actually smiled when I happened to cross this this morning. A little something stupid as well for you. Oops.
[ Laughter ] Ok, a few chuckles, alright. So that's from the popular website Icanhascheeseburger.com which we actually used to integrate into the course's own website. We used to have a wall catch of the day. Based on Q guide evaluations and a recent comment made in computer science 61 for those familiar, we decided to remove Wall Cats slightly from the course this semester, but some of you might enjoy wasting some time there. It's linked on the course's website. Alright, so final project. So it might be a little early, but actually now is pretty much a really good time to start thinking about final projects. And thinking about projects that even if you at this point in time have no idea how you, little old you would actually implement something, that's fine because there's much more to come in the weeks to come including a look at web programming and some of the languages, the technologies and the concepts underlying web based applications. You will find on the course's website a few relatively new links. For one, there is this link called seminars. And all of this is linked right now in the course's homepage. What the teaching fellows have done historically and will continue doing this year is offer a number of seminars on topics related to but nonetheless somewhat tangential to CS50 on material that we just haven't the time or the syllabus for, so that if you're interested in learning about developing a voice based application like Shuttle Boy Voice, well we'll have the infrastructure and the tools to make that possible for you and we'll offer a 60 to 90 minute seminar in early November led by one of the course assistants on how to do exactly that. We'll talk about Rails which is another programming environment, iPhone development, Android phone development. And we actually have access to some pretty neat stuff. So those of you who might like to tackle an iPhone app, you might be vaguely familiar that Apple actually charges people typically to download the tools necessary for such, 99 dollars here, 299 dollars there. Well you won't need to pay any such things. We have our own CS50 account that we can make you part of. And also Google has very generously donated a number of Android cell phones to the course. Some of the teaching fellows are actually carrying this around already and have been playing with Google's API, application programming interface with which they've been writing their own software for these phones. And we'll demo at least one of those apps before long. But if you are a student in this course and are already on TMobile ideally or at least AT and T, both of whom uses GSM networks. And you would like to tackle an android based project, a phone based project using one of Google's new phones, we can provide the phone. And if this is something that you see through to throughition, you'll be welcome to keep the phone as well. So more on that in the time to come. And you'll also see that we have built up an archive over the past couple of years of seminars past, all of which were videotaped and made available online. So even if we're not offering something this year on some topic, realize that there's a couple of dozen of seminars from years past. And there's probably going to be some more to come. And also, I keep using this word API. And even though we've not used APIs at least public ones just yet in the course, we will for P sets 7 and 8. But an application programming interface is essentially, think of it kind of like a header file almost, a dot H file that has a bunch of functions declared in it that you can call inside of your own program. But you didn't have to implement the functionality defined in that API. So in P set 8 for instance, you guys will implement a mash up using Google Maps. Well you're not going to have to invent that from the ground up. You will be able to use Google's tiles and graphics for various maps around the world. You'll be able to use their JAVA script code that allows you to drop little markers on top of maps and pan from location to location. So you'll have a lot of functionality just handed to you in the form of an API or perhaps a library which just means Google has written a lot of functions or methods that you can call in your own code. And then actually wire things up in a more interesting way. So what we've started doing on this page here, which is also linked on our homepage now, is we've just started linking what we think are some fun APIs. And we don't specify on this page what you might want to do with them. The hope is that you'll skim this list at some point and just say oh wow, I had no idea I could do this with Facebook or I had no idea I could do this with Google or Twitter. And then hopefully it will start to instigate some ideas in your own mind. So we'll release formal specs for the final project in a few days time. By next week when we look at, start looking at web programming. But again the hope now is just to start fostering some thought. And it's funny, you've seen on CS50's website that we've been dabbling internally with various apps for events and news and shuttles and all of this. Partly just because we can and hopefully because people find it useful. But we actually had a staff member at Harvard email me the other day saying hey I've been using Harvard News. It'd be great if you did something for Harvard Twit, Twitters, so people who use Twitter, the Twitter audi. So I up until a few weeks ago had never let myself say words like Tweet and similar words, but I've kind of sunk to that now. I don't yet do it myself. Although I have a lot of followers on Twitter because I signed up like a year ago for a Twitter account because we were playing around with CS50 ideas. So apparently people have just been subscribing to mail in it post for the past 12 months. I've never said one thing. I have a lot of subscribers. Maybe we'll start doing something with it at some point. But in any case, I was inspired. So I was away at a workshop last week and I had some down time in the hotel room late at night. And I figured already, maybe this could be interesting to do something with Tweets. And so I figured, you know I vaguely knew what Twitter was and how it works and I figured, alright, they're pretty hip, they're pretty huge now. Maybe they have an API, some way of me interfacing with their data and with people's tweets and all of this so that I could present it maybe on a Harvard specific basis in my own site. So I Googled something like Twitter API, hit enter and wala, Twitter API Wiki or here's the documentation. So I spent some time that night just kind of reading through this. And a lot of this frankly I didn't really need. You'll find that even within Google's APIs, there's a lot of functionality you don't care about. So it need not be overwhelming but rather kind of exciting what you can do with this. And I found ultimately that I could serve through tweets that had been posted around the world. I found that I could search for a specific user names and get their icons and get their actual names and URLs and so forth. And so you know a few hours later that night and then a few hours later the next night, once I had touched things up, thus was born Harvard Tweets. So now we have another stupid little website that simply aggregates tweets from people who know about all over campus, whether person or department or group. And what we did was write a few scripts behind the scenes that synchronizes this site here, Tweets.CS50.net with Twitter's own dataset every 5 or so minutes. So it's actually fascinating since that hotel room last week, Harvard affiliates have tweeted 2344 times. And that number grows by the minute. This was kind of my innovation last night. And innovation is generous since I didn't really do anything here. But I found this on someone's blog just randomly. And I was like oh that's a really cool tag cloud. It's actually a cloud or a sphere. So this is actually a flash animation but this guy too had open sourced his project. He made available essentially what was an API for putting your own words into a 3D cloud like that. And so what these are, these little hash tags in Twitter speak that are simply key words that you can then search on. So if I go over here and I'm curious about people talking about baseball, I can click that. And all of the sudden I get all the recent tweets about baseball, people that have mentioned it. So.
[ Laughter ] You would be amazed, ok. So.
[ Laughter ] Alright, so this is ridiculous. You should see how many posts come from Harvard FML. This is since like last Tuesday night in my hotel room. So you guys are out of control.
[ Laughter ] So, anyhow, at some point soon we might need a new filter of sorts because if you start skimming through this morning alone, I think it must depend on whoever's running the site when they actually wake up and start approving things. There it comes, so that's last night. I mean it starts to overwhelm. But the point here is I mean partly just to distract perhaps. But also it's actually really just neat what you can do. So now you guys have like half a semester behind you essentially and you have some programming chopped. You have hopefully conceptual understanding of how the world works, at least technologically. And there's again much more to come, but doing something like this is relatively now accessible. And the hope for a lot of you, especially those dubbed less comfortable, is that you'll be able to do come term's end for final projects things that you had no sense of your own ability to do at the start of the semester. So hopefully a number of you will amaze yourselves. And you'll find at the end of the semester that the goal is to exhibit precisely those accomplishments in the form of the CS50 fair for which we have several dozen videos from last year, hosted on Facebook. And it was actually a lot of fun. And we will host a similar event for you and your friends this year. Any questions, logistical or otherwise?
[ Background noise ] Alright, so last note here. So dinner, this Wednesday, we had lunch last Friday, we'll do dinner this Wednesday hosted by Matherhouse and also in attendance will be Professional Rodica Nagpaul who teaches a number of course at SCAS including the course dubbed robo soccer. So last spring I had the pleasure of seeing some former CS50 students and teaching fellows compete in robo soccer, whereby they'd spent the entire semester using their computer science and engineering background. And to program these little robotic devices that were supposed to play soccer. And it was really neat, up in the Quad in Hillis Library, they had their little soccer field set up. They had cameras mounted in the ceiling and the cameras then looked down onto the field to figure out where the robots were in order to determine what algorithm should get applied to a particular robot at a given moment in time, given the locations of other robots, given the location of the ball, given the location of the goal. So Rodica will be joining us this Wednesday largely for a casual chat. She'll talk about her work and that kind of stuff. And then it's also an opportunity to chat with me and the teaching fellows and course assistants. So just head to CS50.net/RSVP if you are both hungry and interested in getting got know some of the team. Alright, so as much as excited as I tried to sound at the very start of this semester about four loops and vial loops and do while loops and all of this stuff, so we were kind of misleading. Like none of that stuff is really all that exciting. But we had to get you through it. But it's now after quiz 0 that we can finally dive into what are some more technical challenges. And also some more sophisticated solutions. So up until now, a lot of the problems you've been solving have evolved using you know an array or maybe starting last week thinking about things like link lists. But using those same basis, arrays and linked lists, we'll see this week can you do a lot more sophisticated approaches to similar problems. And we'll talk this week about compress. We'll talk this week about bit wise operations, manipulating bits at the very lowest levels, trees and things called tries and heaps. So data structure that you can continue a discussion of in a course like CS124, but generally that will empower you to solve actual interesting real world problems with a much richer tool kit under your belt. So first a practical tool. So it's been very annoying for you I'm sure to wage battle against seg fault and other memory related errors. And usually if you've triggered a seg fault you've done something per quiz 0 like step over the boundary of an array and touch memory that you don't own. Or you take a pointer and you try accidentally dereferencing it. Or you take null and worse yet you try dereferencing it. So in short, any time you guys have done something wrong, unintentionally with memory, have you possibly triggered a seg fault. Not always, but sometimes. It turns out there's help besides your own eyes and besides the teaching staff. So there are tools like this one, Val Grind which is a Linux or a base tool that you can now run much like GDB on your own binaries. Your own executable code and it will do its best to analyze your program to see did you screw up anywhere, at least with regard to memory. Might you trigger a seg fault? Did you allocate memory that you did not free? So do you have memory leaks? So what I did was I whipped up a little program here, you should have a print out of it today, hand outs were outside, a bit belatedly on the way in. If you didn't get them, there will be some there on during break. But it's a very small program and you can see it here. So I have a main routine that just calls a function called F and then it returns. But what does F do? Well I do an allocation of memory and then I do an assignment. So there's at least two bugs in this code. This is memory.c and whether you have it on paper or just in front of you, can someone identify one of the bugs or mistakes in this program?
[ Background noise ] Say again?
[ Background noise ] Ok so we don't free the malock, so the top of the function F I call mallock and I allocation 10 times 4 bytes, so 40 bytes assuming a 4 byte integer here. So I allocate 40 bytes dynamically and I retain that address, but then I never in fact call freeze. So that's arguably one bug in here. Now granted and to be clear, when this program quits, generally the operating system should take back the memory that was allocated for this program. So realize that even though with little baby programs like this, it might seem lame or unnecessary to bother freeing memory. Very rarely in life will your programs be as simple or as short as this. So this habit of freeing memory that you allocate is indeed important. So that's one bug there. So I haven't freed my memory. And there's another bug, maybe a little more subtle but even worse.
>> Unknown: [Inaudible] the 10th element of array X.
>> Unknown: Ok.
>> Unknown: 0 through 9 element.
>> Unknown: Ok, good. So and let me tweak your jargon. So I don't allocate but I assigned the 10th location or the 11th location of X even though there are only 10 locations there. So because it's 0 indexed, I can go from X bracket 0 through X bracket 9, but 10 is not good. Now not necessarily will this trigger a seg fault, which is why especially C and C plus, plus, some of these bugs are hard to chase down, because the word segmentation fault, the phrase segmentation fault actually hints at what's real, slightly more complicated structure underneath the hood which is that your computer's RAM is organized into different segments and it's really only once you cross segment boundaries that bad things happen. So realize that this might not trigger a seg fault but fortunately can you the human, or we the staff or programs like the one I'm about to demonstrate actually detect even mistakes like that. Because there are situations certainly in which bad things can happen like your program dumping core. So let me do this. Make memory. And that's going to execute the long GCC command. And I'm going to now run valgrand on memory and unfortunately its output is going to be kind of crazy, alright. A little overwhelming at first but let's just glance at the bottom on up, leak summary. So I definitely lost 40 bytes in one block. Alright, so that alone is a pretty clear warning. I have allocated memory that I did not free. It was even able to count that amount. Unfortunately it consists with what I predicted earlier when you pointed out the memory leak. So there's a problem there and then possibly loss, still reachable, suppressive. So this actually looks ok and you know what? Let me take its advice. I'm going to rerun the program with this switch as it's called, this command line argument often in Linux or UNIX or Mac OS when you want to tweak the behavior of a program, yes you can pass in command line arguments. But if they're sort of configuration options, those arguments will often begin with a single slash or a slash, sorry a single hyphen or two hyphens. This is simply convention. So let me do as it suggests and rerun it like this. Alright, so it looks the same at the bottom but I think I got a little more error, a little more reporting up top. So error summary, 14 errors, wow. I don't even have 14 lines of code. So something's really wrong. So let me scroll up and as is the case with GDB, don't focus your intention entirely on the bottom. Don't solve problems from the ground up. Go back to the very first message, because remember there's sometimes a cascading effect where a problem here can actually create the appearance of problems here even if there are none. So let's start at the top. Here's where I executed it. So mem check, a memory error detector, that's indeed what I'm running here. Some copyright information, which is not that useful. This is you know interesting but it feels a little cryptic, so I'm going to turn a blind eye for the moment until ah, here we go. Here is mention of my program. So this is shared library. This thing here. So if it's not your code or not a file you're familiar with, probably not a mistake. In almost certainly not your fault, but let's look here. Invalid write of size 4. Alright, so you can probably guess where this is coming from. So Valgrand is telling me in memory.C, line 2, I'm doing an invalid write of size 4. That is I'm probably trying to write 4 bytes or one integer to a location that it doesn't belong. And let's see, oh ok, so that is the result of having called F, the function F in line 29 of memory.c. So it's the output is similar in spirit to what GDB can do for you. And then let's see here. Alright so this is not good. Let's see, address 0 bytes after a block of size 40 allocated. So I'm calling mallock apparently in line 21 of memory.c, right inside of my F function. So it's not being as clear perhaps as to what the problem is. It's referring to an allocation of memory. But again per the summary down at the bottom, I have definitely lost 40 bytes because of my allocation in this line. So let's see if I can fix this. Let me go ahead and load memory.c again. Alright, so let's fix this to do this. So let's change this to 9, that's probably safe. And let me go ahead and recompile. Alright and now let me rerun Valgrand. Oops, not vim. Let's rerun the memory check. Alright, so good, leak summary definitely lost 40 bytes but if I look up here that mention of writing 4 bytes is no longer there. So that's good and let's fix this other bug. So let me go in here and after I've allocated X, how do I go about freeing the memory pointed at by X?
[ Period of silence ] Anyone. Free X, very difficult, alright, so. Notice you free the pointer, don't dereference the pointer. So let me recompile make memory. Alright, nothing's bad yet so let me rerun the memory check and ok, interesting. Error summary, 13 errors from 6 contexts but and here's where sometimes you should not be misled by the program because it's really doing a very diligent, very anal check of all of the code related to your program. But recall that almost any time you write a C program, you are linking in other people's code, which isn't necessarily buggy but maybe is in fact giving tools like this a little bit of cause for concern. But if I now look through this, I actually don't see any errors that seem to be mine. Mallock free, I did one allocation, one free, 40 bytes allocated. So in fact, no leaks are possible. So this is good. Down here, all heap blocks are free, no leaks are possible and let's do the more succinct one where we don't actually do this command line argument. And in fact none of these seem to be mine. Ok, so you might enjoy or you might be disappointed to go back into say p set 4 which didn't use pointers all that much to run this program on your own code, especially if you did actually use something like Mallock and or free or the lack thereof. And certainly for problem set 6 which will be our last problem set based in C for which you implement that fastest spell checker possible, will you absolutely want to check whether or not you are risking memory errors. Because problem set 6 is the culmination of all these discussions we've been having about data structures and pointers. And it's dare say the most dangerous one with regard to memory, but also very easily solved especially with the help of tools like this. So more on that in the weeks to come. So this was as cute as I could be late at night with the slide. So hexadecimal is something we've talked about already and it's not all that complicated even though it might look a little cryptic and we talked about it in the context of Photoshop and you'll see it again when we do web stuff for colors and all of that. But you'll particularly see it this week because some of the tools you'll be using forensically or to get your program working for problem set 5 is, I forget how I started that sentence, one of the things you'll be doing this week is looking at hexadecimal values. Because of some of the tools you'll be using forensically to debug your code and develop your code. So you'll use a program for instance called xxd which simply dumps the contents of a file, like a JPEG or a BMP file or even your own binary and shows you the bits but it doesn't show you them as zeros and ones which is beyond useless for a typical human. It will instead show you things more succinctly in a more interesting way using hexadecimal. But we bring this up now because you'll start to realize that some details related to files on computers are very much operating system and or CPU specific. Some computers store data differently on disk than they do on say another computer altogether. So what do we mean by this? Well, there is this notion of endianness in the world of computing that refers to how numbesr are stored in memory, RAM, or perhaps on disk, depending on who is writing to disk. So what do I mean by this? Well, let me open up a silly little text editor here. And if I have the hexadecimal number like 1 and I store something like this, int X gets 1. Well in let's see decimal that's just the number 1. In binary it's the number 1, 2, 3, 4, 5, 6, 7, 8 and let me go ahead and copy this. It's going to be 1, 2, 3 and change this and now I'll shrink this down. So that's in binary the number and this is why humans tend not to use binary as a representation system. It's not all that useful. But in hexadecimal how would I write this same value? So that's in binary and incidentally, notionally some people will often add a suffix of lower case b just to indicate this is a binary number. How would I write this same thing in hex? 0X what?
[ Period of silence ] 01 because each of these digits happens to represent how many bits? So 8, so each pair represents 8 because each individual digit represents 4 bits and why was this? Just to go back to a couple weeks ago when we started talking about hexadecimal and we said that each digit 0 through 9 and A through F represented 4 bits, either some pattern of 0s and 1s. So this was the decimal number 0 and 11111 is the hexadecimal number or the decimal number what?
[ Period of silence ] 15 or in hexadecimal we would write this as an F. So here in the left hand column I'll write this all out. So in the left hand column here I have binary and then I have decimal, decimal and then here I have hexadecimal. Ok, so those are my three little contrived columns there. So why is this relevant? Well, this number here, just the number one, it turns out when it is stored in RAM on a typical computer including NICE.fas.harvard.edu it's actually not stored in this format but in the little NDN format. So there's this notion of big Ns and little Ns to numbers. So here is a number, granted in hexadecimal. The little n is the lowest ordered bits. So the tiniest numbers right? Because if I put a one at the other side, this actually makes this a pretty big number. But because the number here is at the right hand side, this is the little n, this is why this is such a small value. So the big N is over here. So this is very nice and neat because when we've been reading binary numbers from left to right, or rather from right to left we have the ones column, the twos column, the fours column and so forth, everything mathematically works out nicely. Unfortunately a typical computer would actually store this number, the number 1 as this.
[ Period of silence ] It would store it in little NDN format where the lowest ordered byte actually comes first in memory. So in other words the thing is essentially reversed in memory. Well what do we mean by that? Well we have this nice little picture which if any of this is unclear it's actually a really nice article on Wikipedia for this topic, NDN ness. Take a look at the top left box there called register. So inside of a CPU are a whole bunch of registers. These are very tiny units of memory that ultimately is where all the action happens, all of the additional, multiplication, subtraction, all of the lowest level operations in the CPU happen in these things called registers. So in that top left box there, 0A, 0B, 0C, 0D, person who made the picture just needed a contrived number that's easy to remember. That is a number just as I might write it by hand on the blackboard or on my little notepad here. So it's from the big end all the way down. So what's the case in a computer that big NDN, what happens is if the left hand thing there represents RAM, where A represents the low of, location A plus 1 represents the next location, A plus 2 is the next location. In other words the bytes are getting incremented from top to bottom in this picture. Notice that the big end of the number 0A does in fact come first in memory. It comes at the lowest ordered address which is little A. Meanwhile the lowest ordered bits which is 0D come later in memory. So in other words at least if you think like I do about numbers in memory, the real number 0A, 0B, 0C, 0D is laid out in memory exactly as you the human would expect, from left to right or in this case from top to bottom. But in a little NDN architecture like nice.fas.harvard.edu and a lot of Intel hardware, what you have instead is kind of the opposite. So if we have that same number in the registers, the same number on a piece of paper, 0A,0B, 0C, 0D, notice that it gets laid out in memory essentially backwards. Where the big end of the number 0A comes last in memory. So this is simply unfortunate because I think it adds unnecessary confusion. But the world decided long ago that some CPUs would be little NDNs, some CPUs would be big NDN and really complicated annoying things happen when two computers need to talk to one another. So thankfully the world has decided on a network NDN format. So because we have this thing called the internet now, where data transfers from points A to B is commonplace and A and B absolutely might not be the same type of computer, thankfully the world has standardized what format is actually used across the wire, so that you don't have one computer talking to another in the wrong order. So what does this mean for you? Well when you take a look at BMP files, bitmap graphics that we've given you as part of problem set 5, you will find when you look at them with this program called XXD, it will show you the contents of bitmap files in memory. You're going to see that the numbers are not the way you would expect them if you wrote down those numbers in memory. So you'll be seeing and the whole spec walks you through this, a bitmap file can be represented with a whole bunch of RGB triples, red green, blue pixels, red, green, blue triples. You'll find unfortunately that these are laid out backwards when you actually look inside. So more on that in the spec itself. And let me go ahead and open this one file here. So vi of let's say NDN.c. Notice we have this little program here. So this is a quick teaser for what you'll be playing with probably partly in problem set 5. Let me go ahead and scroll to the top. I have one main function. This is NDN.c, now you got a little anal check here to make sure I provide a command line argument and if not I just return 1. I don't even yell at the user explicitly. Here recall is the notion for opening a file in read format. So the goal of this program is simply to run it, give it a BMP file, a graphic file like a Windows wallpaper, that's all a BMP file is. And then actually look inside of it and see what some of its fields look like in the order in which numbers are laid out. So let's see, if this thing, this file pointer is null, let's return immediately because that's not a legit file. And then there's some instructions like this, F seek. So you'll see this more in the spec and I'll defer to its explanations. But F seek, file seek is a function that simply starts at a file and then fast forwards, much like an old cassette player would, to a specific location in the file. So F seek seeks two bytes forward in the file, thanks to that 2. Down here I'm doing an F read and as you'll see in the distribution code, which again is well documented, notice that F read is file read. It's going to read into a variable called BF size a number from that file, thanks to the use of FP here. So in short what this program needs to do is given the name of a BMP file, it needs to open it, it then needs to fast forward to a specific location in that file and then it wants to read out a specific chunk of memory, specifically 32 bits which we're going to call BF size. Now why is this the case? Well I can whisk us over here to problem sets and I'm going to go down to just the standard edition of the forensics P set. I'm going to scroll down to a picture I'm looking for which is this one right here, which the specifics of which aren't terribly interesting right now. But what we're ultimately looking for in this file is a specific field. So long story short, files have different parts to them. This is what makes something a file format. A company like Microsoft or someone else declares that the first few bytes will mean this, the next few bytes will mean that, and so forth. This picture as the spec explains discusses exactly what's inside of a bitmap file and where. So the goal very simply of this code, because I know what that picture looks like, is to read out an integer inside of the file that's going to tell me how large it is. How large is this actual BMP? So what do I do next? After I've read in this nt, we report that the value of this nt, BF size is what's inside this variable. Then I do this, it turns out much like a cassette tape. There's a rewind function which rewinds the file to the very beginning. And then well I'm trying to, oh this is getting so much more complicated than I want it to be. Ok and then I do some complicated stuff which, damn. I don't really, let's spoil the ending. NDN. Ok, NDN of I have a file in here called doc.bmp. Ok. Let me skip some of the details because they're not going to be enlightening, they're just going to be boring at this point in time without P set 5 under your belt. So what I have just run here is a program that analyzes a tiny little bitmap file called doc.bmp. I made this with Photoshop. I opened up a one pixel by one pixel file and then I clicked my paintbrush tool and said make this a black dot. Then I save the file as doc.bmp. That's a bit of a white lie because making one little dot like that would not make such a big file. But BF size of this file is 58. This file in total is 58 bytes long. And simply what I did at the very end here, so that very complicated piece of code which I now regret teasing you with already is to show you that in memory, the size of this file is in fact laid out in little NDN format, even though this is an nt. And I would therefore expect it to be something like zeros then some zeros then some zeros then the number 58 because there's 4 bytes to an integer. In fact I do see from left to right that the little end of this number comes first. So let me wave my hand at some of the complexity I tripped over there just to say that when you dive into problem set 5, if not already, these are precisely the encoding issues that you are going to encounter for yourself. But let's take a step back and keep things a little less overwhelming at first and just look at these building blocks. So even though thus far we've been only doing operations like addition and subtraction and multiplication and division in the context of C, there's some other useful operators as well with which you can do much more interesting things even though they're incredibly low level. So henceforth, if you ever want to take two variables, say X and Y, and you want to somehow combine their bits in an interesting way you can do that using these so called bit wise operators. Unlike the arithmetic operators like plus and minus that sort of operate on the number in the aggregate, bit wise operators when applied to this variable and this variable essentially line them up and then apply the operation one bit at a time. And there's the notion of anding bits, together oring them together, exclusively oring them together, complimenting them and also shifting left and right and by this I mean the following. So let me go ahead and just list out let's say a number X and call it 00000. For simplicity I'm not going to use integers because I'd be writing 32 digits all of the time. But let's say this Y is 00001 and suppose the operation I want to perform is X and Y. So X and Y would be represented as X and Y and this equals the result of applying the and operation to each pair of digits that line up together. So let's start from the left to right. 0 and 0. So just logically that's false and false. So what should the answer be if you're anding them together much like we did with Boolean operators weeks ago. So it should be false. So 0 and 0 be clear, 0 and 0 is in fact 0. So that's the first of four digits. The next pair, this guy here and this guy here are also zeros so 0 and 0 gives me 0. 0 and 0 gives me 0 and 0 and 1, false and true gives me 0. Alright so something that's false and true, if you and those two together, if you think of that as an ampersand ampersand operation whether it was in Scratch or it was in C. False and true gives you false as well. Meanwhile if we do this, this is the or operator. So x or y means either one bit or the other must be one. So what about 0 or 0? No. 0 or 0? No. 0 or 0? No. 0 or 1? Yes. And now this one's a little more interesting. X, X or Y, so slightly confusing name but exclusive or represented in the language C using the little carrot operator. This means that one of the bits and only one of the bits can be 1 for the result to be 1. So you need exclusivity, either one and zero or zero and one. But not 0 0 or 1 1. You need exclusivity. So 0 0, 0 X or 0, not exclusive. 0 X or 0, not exclusive. 0 X or 0, not exclusive. 0 X or 1, yes, so now I have this answer here. Now some of them are pretty darn trivial. So Tilda X which here represents the compliment. This just means to flip the bits, 0 becomes 1 and 1 becomes 0, so Tilda X is 1111, that's pretty simple. And then we had a couple of others. Let's say X left shift, so 2 angle brackets 1, let me go ahead and shrink the font a little bit. Actually let's do it to Y. So Y left shift, left shift is going to equal what? So this means take the bits of Y and shift them to the left by one place. So what does the output become?
[ Period of silence ] 001 and then what do you want to put at the left hand side? So 0. So the shift operators indeed pushes all of the bits to the left, if you're using the left shift operator or pushes all of the bits to the right if you're using the right shift operator. And it adds any empty spaces typically with zeros. Though FYI, there's some corner cases when you're using signed integers as to how you handle negative numbers in particular. But we'll focus just on simple positive numbers for now. This is kind of a funny thing and actually let's do this. Y right shift 1 is going to equal all zeros in this case because you're pushing the bit off the end of it. So this one's kind of interesting. If you shift the bits of a variable by 1 position like this, what's the equivalent mathematically?
[ Period of silence ] Yeah, so this is actually a really neat low level way of multiplying or multiplying any number by 2. So if I had the number Y equals 1 up here, shift the bits to the left, you told me that we get this. This is the value. Let's see this is columns, ones columns, two. So that's the value 2. So one becomes 2, if I left shift again by 2 places this number becomes this which is the value in decimal known as 4. Do it again and we get 8, do it again if we have enough bits, 16, 32. So a really neat way of multiplying a value by 2 is simply to shift its bits to the left. So this is one of the underlying representations. Now who cares is an interesting question now, right? So this is all fine and good but I promised that we would actually be solving real problems and not devolving back to for loops and while loops and all of this. So it turns out there's some very interesting applications of these very simple ideas. So these, how many of you just by a show of hands has ever lost data because your hard drive crashed or got corrupted or something bad happened? So a lot of you. Now how many of you didn't actually lose data because you were using a raid 1 or raid 5 set up in your machine? Alright, so not many, maybe just one or two geeks who ever awkwardly raised their hands that second time. So what does this mean? Well frankly if you have a desktop computer these days, there's no excuse for not having a pair of hard drives, two hard drives that are in a so called raid 1 configuration. So raid is redundant array of independent disks. It's just a really fancy way of saying you don't have one hard drive, you have two hard drives. And what's really neat about RAID 1 specifically and you can do this with Mac OS, you can do this with Windows these days, some computer companies like Dell offer this feature now when you use that little web based configurator when buying a computer. RAID 1 takes one disk of size you know let's say a terabyte or 500 gigabytes, then you get another identical hard drive and RAID 1 means that your computer treats them as identical. So any data that gets written here simultaneously gets written here. The upside of this is that if you are very unfortunate to suffer a hard drive failure, this guy just breaks, all of your data is literally still on the other drive. So now you're operating in kind of a dangerous mode. Your computer is smart enough, if it supports this technology called RAID to keep on plugging away using just the one hard drive. But obviously if bad things happen to this hard drive too, then your data is definitely lost. But it decreases the probability that you'll lose data because now you would have to have this drive and this drive fail simultaneously for you to actually lose data. And what's really neat about RAID is that if this drive fails and therefore is useless and this one keeps on plugging away, you can go out, buy an identically sized or larger hard drive, plug it into this slot, throw the other hard drive away and then the computer will automatically synchronize this data over to the new drive and now you're back in a stable RAID 1 configuration. So this is not related at all to bit wise operators. This is just very useful for consumers. But that only scales so well. There's other technologies like RAID 5 which actually lets you have multiple hard drives inside a computer or more commonly inside of a server that allows you to treat multiple hard drives as though they were just one. So if I had 3 1 terabyte drives, RAID 5 would actually let my computer think it had one 2 terabyte drives. So if I had one terabyte, 1 terabyte and 1 terabyte hard drive and I'm using a RAID 5 configuration, I get a total of 2 terabytes of space because one of those terabytes, one hard drive is actually used to store a check sum, which is a term that has came up very early on in the course when talking about ISBNs and credit cards for problem set 1. So if you're willing to sacrifice one hard drive which frankly is pretty reasonable these days given the cost, you can actually not only get this ability to lose a hard drive and keep on plugging away, you can also expand your storage to give the illusion that multiple hard drives are just one really big hard drive. Now how does this work? Well I'm really going to simplify things. So I don't want to write out one trillion bits here or one trillion bytes. So I'm going to instead say that suppose this hard drive here has just 4 bits in it. So I've only stored a tiny little file. I've stored the number 15 and this hard drive simply stores this value, another 4 bit value. What's really neat about RAID 5 is that essentially what it does it is uses the third hard drive simply to store the XOR of the first two hard drives. So using this very simple primitive can you do the following. 1XOR1 is going to give me if I line these left most bits up, 1XOR1 is what? 0, 1XOR1 is 0, 1XOR0, 1 and then 1. So this third hard drive is now essentially my check sum. So what does this mean exactly? Well let's suppose some bad things happened like this guy gets deleted or breaks. So now I've lost my data, right. This is really bad because some of my MP3s are over here, some of them just so happen to be on this hard drive. I seem to have lost it and frankly this is not a back up right? This third hard drive was clearly not the same as the one I lost. It happens coincidently to be the opposite. But what does this mean? Well thanks to RAID 5, XOR can also be applied in the, a reverse direction conceptually. If I now and let me go ahead and do something like can I change colors? Let's say that this guy broke entirely. So red is bad. So what can I do? Well it turns out if I XOR this guy with this guy, what do I get back out? So let's call this result. And now let's see. So look just at the black numbers. 1XOR0 gives me 1, 1XOR0 gives me 1, 1XOR1 gives me 0, 1XOR1 gives me 0. Wala, thanks god for that third hard drive that really wasn't doing anything in terms of space. It wasn't giving me another terabyte. It was just keeping not a backup, but the result of XORing the first two drives together. Can my computer if designed to support RAID recover the data that was lost in red? Let's suppose it was the opposite situation. So the take away to be clear is that this result equals the lost hard drive. So what is instead it's not the hard drive number 2 that is lost, but instead let's make him ok, back to black there. And now let's say it's this guy that actually disappears. So now he is the dead hard drive. Thankfully I've kept around my third hard drive, so let's see what happens here. 1XOR0 gives me 1, same thing gives me 1, same thing gives me 1, 1. Wala, I've recovered my data as well. And what's really neat is RAID 5 doesn't just support 3 hard drives, it supports 4 hard drives, 5 hard drives. You simply use the nth disk as this XOR disk, the check sum disk and can you recover all of your data in this very simple bit wise way. I'm going to take a 5 minute break.
[ Background noise ] Alright we are back, another application of XOR is actually this sort of mind blowing example where you can actually swap two values A and B without using a temporary variable. So alright, we've used this exercise multiple times to demonstrate how not to swap values, how to swap values, how to swap student's houses, so we kind of belabored this notion of swapping, but in every case, quiz 0 and before in lecture did we actually use some temporary storage because just conceptually it's kind of hard to imagine putting this variable here and this variable here without actually clobbering this one in between those two steps. But it turns out using XOR you can do this. I don't think it's all that enlightening to belabor the details, but this is a very quick and dirt main routine that notice declares two variables, X and Y. And this is called Swap2.c for those following along on paper. X is 1, Y is 2. I print out the values of those variables then I call this function swap, passing by reference or pointer, not passing by value because value is very bad. It's not going to work here. But then this program does actually work correctly. Let me go ahead and make swap2 and let me go ahead and run swap2 a the command line. And indeed it works. The interesting question is how does it work? Well notice we have these three lines of code, none of which actually employs a local variable. So I'll leave this I think more as an at home thought exercise, but don't be distracted by the use of pointers. The fact that we're using pointers is irrelevant to the trick itself. The pointers just allow us to actually do swaps at all. So we're using pointers, allows us to actually exchange the original values. So don't think that the magic lies in the stars, but the magic really lies in this carrot, in this XOR operator. So simply by XORing two values together once, and then again and then again while updating the values of A, B and A accordingly in each of those steps, you can literally like take 2 values and I'll do it dramatically with fingers this time, move them from one to the other without losing any data. And it's a neat trick and for those curious you can follow along at home with the slides for today where I belabor the point, line by line by line showing you in comment exactly what's going on underneath the hood, using very simple values of 1 and 4. But it's really neat. And if you're ever asked in some you know silly interview question years from now how can you swap two values without using a temporary variable? This is the answer Microsoft and the like are looking for, using bit wise operators. So it's neat if nothing else. But let's take a look at something that is more enlightening and perhaps more useful longer term. And that is to actually use a notion of a mask. So this is a common idea. And actually let's use these bit wise operations at a lower level, but let's see what we can do. So I'm going to do this from scratch even though you have a copy in front of you for reference, least I screw up on the fly. But I'm just going to go ahead and start, the goal here is to ask the user for an integer and then I want to convert that integer which the user is going to type in decimal. I want it to display as binary. Right I'm really curious to see what binary numbers look like. I want to see what the mapping is like. So if I run this program and I type in the number 1 in decimal, I want to see 31 zeros get printed and then 11 or if I type in 4 billion something or other, I want to see a whole lot of one's be printed across the screen. Alright, so how can we do this? And we'll see what the take away is in a moment. So let me go ahead and include my own library. Let me go ahead and also include let's say standardIO.h. Here's my main into main into arc.c star arg v bracket. Ok, coming along. So now what do I want to do? Well let's go ahead and ask the user, ask user for nt. Alright, so let's see, I'm going to need to store this somewhere and I'm going to need to do the following while the user doesn't cooperate, so I know this construct is generally useful, I'll fill in the blank in a moment. What do I want to do? Well let's do n get nt. Alright and now how do I check that this is actually, actually let's tell the user what to do. So print def give me a non negative nt. Alright, so now while and what do I want to put here to ensure that they're giving me a value I want?
[ Period of silence ] While n is? Less than 0. Alright so this will just pester the user again and again, give me a non negative nt, non negative nt, non negative nt until finally they give me 0 or higher, then this loop will break. And so now I can do something interesting with it. So now I want to print the number in binary. Alright, well this is where things get a little more interesting. Let's at least do a sanity check here. Let's print out the number in integer format, n. And then we'll just return. So let me do, make sure I didn't screw up anywhere else. Ok, so far so good, binary, give me a non negative nt, 1, ok seems to work. I'll give it 15. Ok, seems to work, alright. So let's move on now. Alright print number in binary. So how do I print a number in binary? So there's no, unlike JAVA there's not just a function that says print this in binary. There's most anything in languages like that. So C we have to kind of get our hands dirty or go lower level, but let's see I know how a number is represented underneath the hood. It's just an nt, let's see it's 32 bits. So I essentially just need a loop that's going to iterate 32 times conceptually from the left end of my number all the way to my right end, because print def is only going to print from left to right. I can't go backwards as we usually do on the screen. So I need a loop that's going to execute 32 times. So let me start off instinctively like this. Nt is 0, I is less than 32, I plus plus. So that's the right idea. I can maybe clean this up a little bit. In fact, I kind of want to do it from the left hand side, but we'll come back to that. I'll clean up that loop, but I at least have the basic building block in place. Now what do I want to do? Well I essentially want to ask this question on every iteration. Is the current bit a 1? If so print a 1. Is the current bit a 0? Print a 0. Because we're now talking about individual bits. But what I'm going to print to the screen is actually quote unquote 0 or quote unquote 1. So I need to essentially convert these tiny little bits to an actually ASCII character so the user can understand. So you know what I'm going to do? I'm actually going to scroll back because I want to do this from the left hand side first of the number, not the right hand side. So really I'm going to do this. And I'm going to be a little more dynamic. So nt I gets size of nt, oops, size of nt. Alright, so that's going to give me a 4 bytes but I want 8 bytes. Sorry, 8 bits per byte so size of nt times 8 gives me what number? Quick sanity check. 32 hopefully. On the 32 bit machine, this is going to give me the value of 32 because it's 4 times 8. But I don't want to go to the 32 bit because really if the bits are 0 indexed it's 31. So I'm going to be a little defensive and I'm going to start at 31. So the left most bit is going to be bit 31 because the right most bit is going to be bit number 0. So I want to do this so long as I is greater than or equal to 0. And then I want to decrement I on each iteration. So again the goal is if I have a really long 32 bit value, I conceptually just need I to point at the bit 31 then decrement to 30, 29, 28, dot, dot, dot. To bit 0 because what I want to do on each iteration to be clear is print out in ASCII form that particular bit, 0 or 1. So the question then becomes if I just have a chunk of 4 bits called an nt, how do I actually get at individual integers? Thus far we've not actually given you the ability. And you've probably not discovered the ability to actually manipulate individual bits. The smallest unit of data that you've probably been able to manipulate thus far is what? How many bytes or how many bits?
[ Period of silence ] Like what's the smallest piece of data you've actually changed or read in any program you've written thus far?
[ Background noise ] A char right? 8 bits or 1 byte is a char, but we want to go deeper than that. And there's no data type called bit, so we actually have to accept the fact that we have some limitations and figure out how inside of a char or inside of an nt we can nonetheless get at something we care about. And this common trick, this is just a piece of jargon, but it's an idea that will recur in programs perhaps throughout your life if you continue playing with stuff like this. We can define what's called a bit mask. So a mask is just a value that usually mostly zeros, but it has a 1 any location that you happen to care about at the current moment in time. And you can actually flip the definition and say it's all zeros, it's all ones except for occasional zeros. But I'm going to think of this mask, kind of in Photoshop form, in overlay where I want a whole bunch of zeros but I want a 1 at the location I currently care about. So in short, on every iteration now my goal I've decided is to give, is to create for myself a 32 bit value, all zeros but I want to turn on the bit I care about. Bit 31 followed by bit 30, followed by bit 30, 29, all the way down. And then on each iteration do I want to change all the 1s I've created back to 0s. So I want a big sequence of 0s and then just a 1 that moves from left to right on each iteration. So how do I achieve this? Well using these primitives from today. If I want a mask, if I want a 1 at the ith location, it turns out I can just do this. So I'm going to declare a mask to be 32 bits. That's why I'm using an nt. I need a 1 at a certain location. Well what location do I want it at at first? I want it at location 31. So using left shift, what this means is when I simply do 1 left shift I that's really like saying on the first iteration 1 left shift 31, which means ok, take a 1 and then shift it to the left 31 places, which essentially means, 1, 2, 3, 4, 5, 6, 7, 8, let's just make this look like an actual nt. And I'll shrink it so it actually fits on the screen. And that gives us this, get rid of our stupid color wheel. So if I have a 1 as I do at this point in time and then I shift it 31 places, that means my 1 goes 1, 2, 3, 4, dot, dot, dot. It eventually ends up over here. So my mask is now this integer of all zeros but a 1 bit at the location I care about. And now using another bit wise operator can I actually check a specific bit in the number that I got from the user. So let's take a look. Let me go ahead now and do if n, the number the user gave me, ended with bitwise ended with my mask is true, if that gives me back a positive value and rather a non 0 value, what do I want to print? In other words if the user's number ended with my current mask gives me a non 0 value, what does that mean? So I want to print a 1 because it means a 1 is there, else and I'll rewind in a second, I want to print a 0. Now why does this work? Well let's take a look at this. The mask I've created is in fact this. So now the user, the very, let's suppose the user types in 15. Let me use the same number of bits. So the user is going to type in 15 as I did in my second demo there, 1, 2, 3, 4. So this is the number 15 that the user has typed in. Sure more sophisticated tools exist than text edit for lectures, but that's ok. And this is my mask. And got to work on my font size. Alright, so this is my mask and I'll space things accordingly. Alright, this is a nice state of the world. N is that value 15 that the user typed in. My mask initially is a 1 all the way over at the left hand location. And then I do this and operator so what does that actually give me at this location? 0 and 1 is 0 and then how about everything else? If you just fast forward to the end. It's all zeros. So this is interesting because if I take the user's number N and and it with my current mask and my mask current has just a single bit set at the 31 location. And the answer comes back as zero. Well what's the take away? That that one location all the way on the left was a 0. Because if it were a one, what number would I have gotten back? If the user had typed in something like 4 billion, something atrocious and then I had anded those two values together, what then would I get as one and one? So I'd actually get a value like god a 1 here and then yes, a whole lot of zeros. And I don't know what this number is off hand. This is like 4 billion, 2 billion something or other right now. But what's the take away? 2 billion is non 0 which is why my check here if N anded with the mask gives me anything other than 0, if it gives me any 1 bits anywhere, ie a non 0 number, then I found a one at that location. Otherwise it finds only a zero and so I'll print that. So that's all it actually takes. I now have a loop that we've already concluded is going to iterate 32 times. I'm creating on each iteration a temporary mask, a temporary variable that just is all zeros except a 1 bit at interesting locations from left to right, 1 bit at a time. And the goal of that is to leverage the reality that if you and a number with a mask, the only bits that are going to be allowed to pass through this filter of sorts, through this mask is going to be a one bit in the original value. And that's kind of where the idea, the word gets its name. So if I have a mask with a 1 here, the only bits that will be allowed to pass through to be clear are where there are ones in the mask. So this number passes through. These numbers all get blocked by the and operation but that passing through of the left most one is enough o make me realize, oh this expression N and mask is true because remember true does not mean 1. True means non 0, false means 0, true means non 0, more generally. So let's actually now compile this and if I've made no mistakes, fingers crossed, ok mistakes, oh I called it print instead of print def. That's easy to fix. Alright. Let's go ahead and recompile. Better, so let me run binary in my current working directory, give me a non negative nt, let's give it 1. Ok, I need kind of a print def there, so let me go back in here and do a little print def of backslash N. And now let's recompile this. Just for aesthetic reasons. Ok. Now I'll give it 1. Ok so that's kind of cool. Underwhelming, let's give it something more interesting, 15. Ok, so now again all I've done is iterate 32 times left to right over that underlying 15 to get this result and let's do something like 2 billion. Enter, that's apparently what 2 billion in decimal terms actually equals in binary. So it's kind of neat. And this idea, this is a problem you will not often solve simply printing out numbers in binary. But this idea of selecting individual bits is actually a really powerful one. In some of your problems, that's like problem set 4, how many of you guys used an array, maybe a 1D or a 2 dimensional array of boles simply to remember which of the numbers had come with the bore. Anyone take this approach? Yeah, so from just chatter on the bulletin board, this seemed to be a not uncommon approach. Recall that p set 4 required that you remembered which numbers came with the board and which ones did not. So that you could actually prevent the user from changing some of them. Well, you guys, it sounds used a 2, some of you used a 2 dimensional array storing a whole bunch of Booleans so what is that? 81 Booleans to keep track of true or false. That was really a waste of space, right? You used 8 bits probably or you used 8 bits because a bole, if you ever really looked closely is represented with what data type? I think it's actually represented with what's called an unsigned char, but either way. The smallest data type you said we have in C is a char. Which means a bole underneath the hood is actually 8 bits. You were using 7 bits more than you needed to for every one of those true false values. So a very common approach to these low level bitwise operations is actually to compress your program into using only as much memory as it needs to. Because right intuitively a much clever, more clever approach would have been yes to have the notion of a 2D array, but just use a single bit for each of those locations. And you could have used one eighth the memory to remember something like which numbers came and did not come with the board. How do you actually implement this? Well you could actually now implement just a chunk of memory, if you think back to that problem set. You could actually implement yourself by manipulating individual bits. And so what you'll find in this world of bitwise operations, you can implement a function called set. And you can implement a function called get that simply takes a chunk of memory and flips individual bits on and off and get simply checks them. So we've looked at setting bits, sorry. We've used getting bits using the and operation, which of our several bitwise operations might be appropriate for actually setting a value? ANDOR, XOR, or compliment?
[ Background noise ] How might you turn on a bit?
[ Background noise ] Yeah, so or is actually the case. And how do you do that? Well if you wanted to OR two numbers together, what you could do is something like this. Let me get rid of my little scratch pad. If I had nt X gets let's say something like 0, and this actually equals 00000. And then I have, suppose I want to set the one bit, suppose I want to set this bit. What operation can I perform? Well I can actually do X GETS X OR 1 and that would actually set the left most bit. Why? Because if you line these two numbers up, 00000 and 00001, and you OR them together, well this guy here is obviously 0, 0 or 0, 0 or 0, 0 or 0, 0 or 1. So in fact the or operator has the effect of turning bits on, so long as you use something like assignment to keep that value around. So any questions on bitwise manipulation? Alright, so let's lay the foundation here for these higher level data structures that I promised. So we've used arrays, we talked about link lists, you'll use link lists, many of you for problem set 6. You'll find that you have some design discretion for that problem set. But suppose the problem at hand is to actually store data like words in a dictionary in some structure that facilitates answering questions like is this word here. A data structure that facilitates maybe even sorting this information faster than something like a link list could. So there's a problem with a linked list. If we implement, we're going to have you like for problem set 6, 140000 English words. And we're going to hand you a really big text file and you're going to be challenged with reading that file into memory using things like F read and other little functions we've seen or will see. And you've got to somehow represent all of those strings in memory because then we're going to make your code spell check actual files, really big files like the screenplay from Austin Powers and the King James the fifth bible and really large texts that will have some spelling mistakes or at least unrecognized words. And you're going to have to do this quickly. Unfortunately if you just load all 140000 words into an array or maybe being a little fancier a linked list, what's the running time of your lookup function going to be? What's the running time of your spell checking function going to be? I mean it's linear. Every time you might have to on average you're going to look at half of the word, so 70000 words every damn time we ask a question is this word spelled correctly. So can we do better? Well it turned out we can with any number of data structures, one of which is this thing called the hash table. And it turns out you'll see this for real in PHP and JAVA script, sort of the data structure that's known as the universal data structure, the one that some geeky computer scientist once joked that if you're on a desert island what data structure would you want to bring with you? Well you'd bring this thing called a hash table because you can do most anything with it. Whether sloppily or otherwise, but a hash table is simply this. It's essentially a big array and before you put something into this array you ask yourself the question at what location does this new element belong. In other words given a word from a dictionary, you ask yourself here's a big array or here's a bunch of buckets. Which bucket should I put this word into? Alright, so there's a problem though with this approach in general. If you're treating a hash table as just a big array or a whole bunch of buckets and suppose I take in a word like apple. And you decide somewhat arbitrarily that all of the apple, all the words starting with A should probably go where intuitively? At the 0 location right, just because right? A words will start at the beginning. And then I get banana, where should I put that into these buckets? Put it at location table bracket 1 and then zoo would go at the very bottom of the table. But there's a problem. What if then another word from the dictionary is aardvark. Well where does that go? If you've already got apple inside this table?
[ Period of silence ] Well those are the only three words so far. Apple, banana and zoo. And now I insert aardvark. Unfortunately bracket 0, bracket 1 and bracket 25 are already taken using our very simple heuristic, so what's, we need to get the word in there. We need aardvark to be in the dictionary. Where can we put it?
[ Period of silence ] Now you only have, alright, so there are what 23 remaining answers. You have a 1 in 23 chance of getting this right. Where do you put aardvark?
[ Background noise ] So two location, sure. Right? In short, who cares, right? Put it somewhere and so there's this general approach called linear probing which means try to put it where you want to. So in this case I'm going to try to put it at the zero location because that's where the A word should go. Unfortunately I have what's called a collision which means I already have something in that location. So probe the buckets, probe the hash table, just step by step by step until you find an empty slot, then plot aardvark in there. So we probe bracket 0 and we find apple there. We probe the next location and we find banana, ah ha! There's no C words yet, let's steal the C words place and put aardvark at table bracket 2. And you continue in this way. So now you are going to bump up against a real world limit. How big can this dictionary obviously be? So 26 words and that's it. So we're kind of screwed with our 140000 word dictionary already. So hopefully there's a fundamentally better approach but there's a problem fundamentally with linear probing itself which is that you're really hurting your look up time. The amazing thing about hash tables is that in the best case lookup times for words, lookup times for elements are how fast? In other words if I ask you is apple a word, how many steps, how many units of time do you need to answer that question with this data structure?
[ Background noise ] One, right? Because I hand you apple, you check the first letter, oh A, let me check my 0th location. Oh, there's apple, constant time, right? One operation of time is required. What about banana, is banana a word? Check the B, oh brackets 1, yes, banana is a word. But now you've got this problem. What if I hand you aardvark? Is aardvark a word? Well, you check its first letter and say damn, apple is there, but maybe I've got it elsewhere. Let me look. And that process of let me look now devolves into how many steps in the worse case? N, right? So we have this really interesting trade off. The motivation for hash tables if really cleverly designed is to achieve constant time lookups for pieces of information. And we have yet to discover this recently because we did have this with arrays. But even with arrays our search, our search time eventually became log N when we couldn't, when we had an arbitrary number of elements in here. So this kind of begs the question how big, how bad is this problem, right? One approach to this problem of having collisions between apple and aardvark is fine, yes this was just stupid using 26 buckets. Let's use 1000 buckets. And then with use a more sophisticated hash function then just checking the first letter, right? Slightly more interesting then checking the first letter might be why don't we check the first two letters? Well that helps us with aardvark and apple because we have aa and ap. So those would end up in different buckets. But I'm sure we could find another word that starts with AA or another word that starts with AP that's going to create another collision. So the problem is that even though this is sort of the holy grail of data structures, constant time operations because we can blow out of the water every other algorithm we've implemented thus far. We have this problem of collisions and we'll see that this problem will become in a room full of N CS50 students, what's the probability that 2 of you have the same birthday. Turns out that happens with strikingly high probability, which means in the analog of the world of apples and aardvarks, it's going to happen really frequently here too. And so we'll need a much more sophisticated approach than this, but for that we wait till Wednesday. See you then.
[ Music ]
==== Transcribed by Automatic Sync Technologies ====