Week 7 Wednesday

Andrew Sellergren

Announcements and Demos (0:00-10:00)

Now that you know a little bit about HTML, perhaps you can appreciate the humor of this tattoo.
Happy birthday, Nate!
Head to x.cs50.net/hello to be greeted by your CS50x classmates!
The scavenger hunt form Problem Set 4 is afoot! Together with your section, find as many of the computer scientists as you can and get pictures with them.
We're now accepting designs for CS50 apparel at cs50.net/design. The technical requirements are as follows:
- PNG
- 200+ DPI
- <= 4000 x 4000 pixels
- <= 10 MB
Have a campus problem that could be solved with a mobile app? Need a website for your student group? The Final Project is on the horizon! The specification is now available here. Know that the Pre-Proposal is just meant to get a conversation started between you and your TF. If you're struggling to come up with your own idea, check out projects.cs50.net for some ideas shared by others. Plenty of seminars and APIs are available to get you up and running in new technologies. APIs, or application programming interfaces, are libraries of code that empower you to interact with third-party applications and data. For example, the HarvardFood API is a service we wrote that scrapes the HUDS website to get daily menu information. If you employ this API, you can use this menu information in CSV, JSON, or serialized PHP format within your own code.

HTML and HTTP (10:00-55:00)

HTML stands for hypertext markup language. The word "markup" implies that it's not a programming language, per se. For the most part you can't (and shouldn't) express logic in HTML. Rather, HTML allows you to tell the browser how to render a webpage using start and end tags.
You can actually view the HTML of a webpage by right clicking and selecting View Source in most major browsers. In general, HTML has the following skeleton structure:
```
  <!DOCTYPE html>

  <html>
      <head>
      </head>
      <body>
      </body>
  </html>
```
The first line is the doctype declaration which tells the browser "here comes some HTML." Everything after that is enclosed in the <html> tag. This tag has two children, <head> and <body>. A very simple webpage we looked at last time was as follows:
```
  <!DOCTYPE html>

  <html>
      <head>
          <title>hello, world</title>
      </head>
      <body>
          hello, world
      </body>
  </html>
```
If you copy this code into a text editor and save it as a .html file, you can open it in your browser and it will be displayed just like any other webpage. However, you'll be the only one that can view it since it's stored locally. If we want others to see it, we need to employ a web server. Turns out you already have one called httpd installed on the Appliance.
If you run the command ls in your home directory on the Appliance, you'll see a directory called vhosts. You can actually run multiple websites on a single web server using this concept of virtual hosts. When a browser sends a request to that web server, it will include in that request the domain name that it wants a response from. Within the vhosts folder, there's a single folder named localhost. On another web server, there might be multiple folders here, one for each website that the server hosts. Within the localhost folder is an html folder where we can store our HTML source code.
If we navigate to http://localhost on Chrome in our Appliance, we see a page that says "Index of /" and has nothing listed there. This is pointing to the html folder, so if we take the "hello, world" code from above and save it as hello.html in this directory, we will now see it listed. We can click on it to see the actual webpage displayed.
As an aside, we have to actually launch the web server before this will work. To do this (as the problem set specification details), we run the command sudo service httpd start.
Instead of displaying a list of files in the html directory, we'd rather have a default webpage displayed when we navigate to http://localhost. To do this, we can simply rename hello.html to index.html. Apache (the web server we're running) knows to render this file when a request is made for the root of a directory.

Let's start tweaking our HTML a little bit:

  <!DOCTYPE html>

  <html>
      <head>
          <title>hello, world</title>
      </head>
      <body>
          This is CS50, Harvard College's...

          Prerequisites: none
      </body>
  </html>

When this page is rendered, the "Prerequisites: None" line will appear on the same line as "This is CS50, Harvard College's..." Recall that whitespace in the HTML source code doesn't always translate to whitespace on the webpage itself. If we want a linebreak between these two lines, we have to manually insert one with the <br/> tag:
```
  <!DOCTYPE html>

  <html>
      <head>
          <title>hello, world</title>
      </head>
      <body>
          This is CS50, Harvard College's...
          <br/>
          Prerequisites: none
      </body>
  </html>
```
<br/> is an empty tag, meaning that it is both an open and a close tag in one. That's why we write the slash before the right angle bracket.

Let's add some more text:

  <!DOCTYPE html>

  <html>
      <head>
          <title>hello, world</title>
      </head>
      <body>
          This is CS50, Harvard College's...
          <br/>
          A quick brown fox jumps over a lazy dog.
          A quick brown fox jumps over a lazy dog.
          A quick brown fox jumps over a lazy dog.
          A quick brown fox jumps over a lazy dog.
          A quick brown fox jumps over a lazy dog.
          A quick brown fox jumps over a lazy dog.
          A quick brown fox jumps over a lazy dog.
          A quick brown fox jumps over a lazy dog.
      </body>
  </html>

Now that we have more text, we should probably start to think about how we might organize it into paragraphs. There's even a paragraph tag in HTML written as <p>:

  <!DOCTYPE html>

  <html>
      <head>
          <title>hello, world</title>
      </head>
      <body>
          <p>
              This is CS50, Harvard College's...
          </p>
          <p>
              A quick brown fox jumps over a lazy dog.
              A quick brown fox jumps over a lazy dog.
              A quick brown fox jumps over a lazy dog.
              A quick brown fox jumps over a lazy dog.
              A quick brown fox jumps over a lazy dog.
              A quick brown fox jumps over a lazy dog.
              A quick brown fox jumps over a lazy dog.
              A quick brown fox jumps over a lazy dog.
          </p>
      </body>
  </html>

The <p> tag will actually insert linebreaks for us.

Suppose we want to link to another website's image of a quick brown fox. We can use the <a> tag to do this:

  <!DOCTYPE html>

  <html>
      <head>
          <title>hello, world</title>
      </head>
      <body>
          <p>
              This is CS50, Harvard College's...
          </p>
          <p>
              A <a href="...">quick brown fox</a> jumps over a lazy dog.
              A quick brown fox jumps over a lazy dog.
              A quick brown fox jumps over a lazy dog.
              A quick brown fox jumps over a lazy dog.
              A quick brown fox jumps over a lazy dog.
              A quick brown fox jumps over a lazy dog.
              A quick brown fox jumps over a lazy dog.
              A quick brown fox jumps over a lazy dog.
          </p>
      </body>
  </html>

Note that the dots inside the href attribute are a placeholder for the URL that we want to link to, omitted here for brevity.

Rather than link to the image, we can actually include it in our webpage using the <img> tag:

  <!DOCTYPE html>

  <html>
      <head>
          <title>hello, world</title>
      </head>
      <body>
          <p>
              This is CS50, Harvard College's...
          </p>
          <p>
              A quick brown fox jumps over a lazy dog.
          </p>
          <p>
              <img alt="quick brown fox" src="http://..."/>
          </p>
      </body>
  </html>

The alt text is a short description of the image for the benefit of the visually impaired and anyone who might not be able to download or view the image.

Now that we know HTML, we can write our own search engine! Well, not quite, but at least we can copy Google's homepage:

  <!--

  search0.html

  David J. Malan
  malan@harvard.edu

  Demonstrates form submission.

  -->

  <!DOCTYPE html>

  <html>
      <head>
          <title>CS50 Search</title>
      </head>
      <body>
          <h1>CS50 Search</h1>
          <form action="http://www.google.com/search" method="get">
              <input name="q" type="text"/>
              <br/>
              <input type="submit" value="CS50 Search"/>
          </form>
      </body>
  </html>

Comments in HTML are enclosed by . The <h1> tag is a heading tag which by default specifies big and bold text. There are also <h2>, <h3>, <h4>, <h5>, and <h6> tags which specify progressively smaller text.
To take input from the user, we use the <form> tag. The action attribute specifies the URL that will handle the user's input once its submitted. As you can see, we're cheating a little bit here by passing the buck to Google. The two most common values for the method attribute are "get" and "post." Here, we're using "get." More on this later.
The <input> tag with type attribute specified as "text" creates a textbox input for our form. The <input> tag with the type attribute specified as "submit" creates a submit button with the label set according to the value attribute.
When we navigate to http://localhost/search0.html, we get a Forbidden error. By default, our filesystem assumes that we don't want our files readable by everyone. In the case of a webpage, we probably do want it to be readable by everyone. To see who has permission to read a file, we run the ls -l command. On the far left, we'll see sequences of dashes that look something like this:
```
  -rw-------
```
We can separate these out into a single dash followed by three groups of three:
```
  - rw- --- ---
```
The first dash is filled in with a d if this item is a directory. The three groups of three dashes define the execute, read, and write permissions of the owner, the group, and the world, respectively. The fact that the third group for the file above is --- means that the world has no permissions on this file. If we want to be able to view this file as a webpage, we need to give the world read permissions. To do this, we run the following command:
```
  chmod a+r search0.html
```
This means add the r permissions for a, or all, on the file search0.html.
Now, when we navigate to http://localhost/search0.html, we see an ugly looking search engine. If we type "quick brown fox" and hit Enter, we are whisked away to a search results page provided by Google. How did this work? If we peek at the address bar, we see that the URL we've landed on is the following:
```
  http://www.google.com/search?q=quick+brown+fox
```
It appears that the "get" method actually takes our input and appends it to the URL specified in the action attribute. The ? signifies that a series of key-value pairs are about to be provided. These key-value pairs are parameters that the user provided in the form. In this case, we named our parameter q for query, so we have q=quick+brown+fox. The + denotes a space.
Under the hood, our request to render search0.html looks something like this:
```
  GET /search0.html HTTP/1.1
```
To view the HTTP, we can click on the Network tab in Chrome's (or IE's or Firefox's) Developer Tools. We then can see that the response begins with the following:
```
  HTTP/1.1 200 OK
```
You may never have seen the number 200 as a response code, but chances are you've seen the numbers 404 as "not found," 403 as "forbidden," and 500 as "server error" response codes. The number 304 means "not modified," which implies that the browser can serve up a cached version of the page rather than asking the server to retransmit data that hasn't changed.
If we keep the Network tab open when we click the CS50 Search button, we see a whole bunch of files are downloaded. Looking at just the first, we see that the request headers start with the following:
```
  GET /search?q=quick+brown+fox HTTP/1.1
  Host: www.google.com
```
Interestingly, Chrome will also show the query string parameters which it parsed from the URL.
Thus far, we've taken it for granted that we can type in a URL into our browser and get back data from halfway across the country or the world. How is this working? We can run a command traceroute to actually see the path that our request takes. We know that Stanford is across the country, so let's run traceroute www.stanford.edu. Each of the lines in the output represents a router that our request went through. The first few are Harvard's routers, but by step 6, we see that we're in Kansas already! On the right side, there are three time values which represent three measurements of the number of milliseconds it took to reach this router. Next to the Kansas router, the values are around 60 milliseconds. Pretty cool how fast we can get to Kansas and back! Lines that are just three asterisks represent routers that ignore this type of request, so we don't know where they are. If we run traceroute www.cam.ac.uk for University of Cambridge's website, we can see there's a big gap in time between two of the steps where our request is traversing the Atlantic Ocean.
To summarize what we've learned thus far, HTML is a language that allows us to indicate how we want a browser to render our webpage. To get the HTML from other websites, we make a request via HTTP, the protocol for communicating between client and server. The server responds with a numeric code indicating whether everything is OK (e.g. 200) or if some error occurred (e.g. 403, 404, 500).

TCP/IP (55:00-74:00)

HTTP actually sits on top of TCP/IP, the protocol which dictates how packets of information travel from client to server. We saw this protocol in action when we ran traceroute to watch our request hop from router to router. TCP/IP dictates that every computer be identifiable by an IP address of the form w.x.y.z, where w, x, y, and z are numbers from 0 to 255. This implies that IP addresses are 32 bits, so there are 4 billion possible values. Believe it or not, we're actually running out of IP addresses. As a result, there's been a push to move toward IPv6 which uses 128 bits for addresses.
An IP address is actually not enough to handle communication between client and server. Both client and server also have specific ports to handle different types of traffic. Port 80 is the default for HTTP. There's also port 443 for HTTPS, 25 for SMTP (an e-mail protocol), and 22 for SSH. Your browser normally hides this port information from you, but you can verify that it's using it by navigating to http://www.facebook.com:80. Notice that the 80 just disappears and you end up on Facebook.
If an IP address is necessary to identify a server, why don't we have to type it into our browser to get to www.cnn.com? Our browser does this lookup for us. We can do it manually by running the command nslookup www.cnn.com. Interestingly, you can take the IP address that this command returns and navigate to it in your browser to see CNN's homepage! The process of translating a URL like www.cnn.com into an IP address is known as a DNS lookup. DNS stands for domain name system. There are special servers called DNS servers that respond to these lookups.
For a fun, high-level look at how the internet works, check out Warriors of the Net.