Andrew Sellergren
You can actually view the HTML of a webpage by right clicking and selecting View Source in most major browsers. In general, HTML has the following skeleton structure:
<!DOCTYPE html>
<html>
<head>
</head>
<body>
</body>
</html>
The first line is the doctype declaration which tells the browser "here comes some HTML." Everything after that is enclosed in the <html>
tag. This tag has two children, <head>
and <body>
. A very simple webpage we looked at last time was as follows:
<!DOCTYPE html>
<html>
<head>
<title>hello, world</title>
</head>
<body>
hello, world
</body>
</html>
.html
file, you can open it in your browser and it will be displayed just like any other webpage. However, you'll be the only one that can view it since it's stored locally. If we want others to see it, we need to employ a web server. Turns out you already have one called httpd
installed on the Appliance.ls
in your home directory on the Appliance, you'll see a directory called vhosts
. You can actually run multiple websites on a single web server using this concept of virtual hosts. When a browser sends a request to that web server, it will include in that request the domain name that it wants a response from. Within the vhosts
folder, there's a single folder named localhost
. On another web server, there might be multiple folders here, one for each website that the server hosts. Within the localhost
folder is an html
folder where we can store our HTML source code.html
folder, so if we take the "hello, world" code from above and save it as hello.html
in this directory, we will now see it listed. We can click on it to see the actual webpage displayed.sudo service httpd start
.html
directory, we'd rather have a default webpage displayed when we navigate to http://localhost. To do this, we can simply rename hello.html
to index.html
. Apache (the web server we're running) knows to render this file when a request is made for the root of a directory.Let's start tweaking our HTML a little bit:
<!DOCTYPE html>
<html>
<head>
<title>hello, world</title>
</head>
<body>
This is CS50, Harvard College's...
Prerequisites: none
</body>
</html>
When this page is rendered, the "Prerequisites: None" line will appear on the same line as "This is CS50, Harvard College's..." Recall that whitespace in the HTML source code doesn't always translate to whitespace on the webpage itself. If we want a linebreak between these two lines, we have to manually insert one with the <br/>
tag:
<!DOCTYPE html>
<html>
<head>
<title>hello, world</title>
</head>
<body>
This is CS50, Harvard College's...
<br/>
Prerequisites: none
</body>
</html>
<br/>
is an empty tag, meaning that it is both an open and a close tag in one. That's why we write the slash before the right angle bracket.Let's add some more text:
<!DOCTYPE html>
<html>
<head>
<title>hello, world</title>
</head>
<body>
This is CS50, Harvard College's...
<br/>
A quick brown fox jumps over a lazy dog.
A quick brown fox jumps over a lazy dog.
A quick brown fox jumps over a lazy dog.
A quick brown fox jumps over a lazy dog.
A quick brown fox jumps over a lazy dog.
A quick brown fox jumps over a lazy dog.
A quick brown fox jumps over a lazy dog.
A quick brown fox jumps over a lazy dog.
</body>
</html>
Now that we have more text, we should probably start to think about how we might organize it into paragraphs. There's even a paragraph tag in HTML written as <p>
:
<!DOCTYPE html>
<html>
<head>
<title>hello, world</title>
</head>
<body>
<p>
This is CS50, Harvard College's...
</p>
<p>
A quick brown fox jumps over a lazy dog.
A quick brown fox jumps over a lazy dog.
A quick brown fox jumps over a lazy dog.
A quick brown fox jumps over a lazy dog.
A quick brown fox jumps over a lazy dog.
A quick brown fox jumps over a lazy dog.
A quick brown fox jumps over a lazy dog.
A quick brown fox jumps over a lazy dog.
</p>
</body>
</html>
<p>
tag will actually insert linebreaks for us.Suppose we want to link to another website's image of a quick brown fox. We can use the <a>
tag to do this:
<!DOCTYPE html>
<html>
<head>
<title>hello, world</title>
</head>
<body>
<p>
This is CS50, Harvard College's...
</p>
<p>
A <a href="...">quick brown fox</a> jumps over a lazy dog.
A quick brown fox jumps over a lazy dog.
A quick brown fox jumps over a lazy dog.
A quick brown fox jumps over a lazy dog.
A quick brown fox jumps over a lazy dog.
A quick brown fox jumps over a lazy dog.
A quick brown fox jumps over a lazy dog.
A quick brown fox jumps over a lazy dog.
</p>
</body>
</html>
href
attribute are a placeholder for the URL that we want to link to, omitted here for brevity.Rather than link to the image, we can actually include it in our webpage using the <img>
tag:
<!DOCTYPE html>
<html>
<head>
<title>hello, world</title>
</head>
<body>
<p>
This is CS50, Harvard College's...
</p>
<p>
A quick brown fox jumps over a lazy dog.
</p>
<p>
<img alt="quick brown fox" src="http://..."/>
</p>
</body>
</html>
alt
text is a short description of the image for the benefit of the visually impaired and anyone who might not be able to download or view the image.Now that we know HTML, we can write our own search engine! Well, not quite, but at least we can copy Google's homepage:
<!--
search0.html
David J. Malan
malan@harvard.edu
Demonstrates form submission.
-->
<!DOCTYPE html>
<html>
<head>
<title>CS50 Search</title>
</head>
<body>
<h1>CS50 Search</h1>
<form action="http://www.google.com/search" method="get">
<input name="q" type="text"/>
<br/>
<input type="submit" value="CS50 Search"/>
</form>
</body>
</html>
<!--
and -->
. The <h1>
tag is a heading tag which by default specifies big and bold text. There are also <h2>
, <h3>
, <h4>
, <h5>
, and <h6>
tags which specify progressively smaller text.<form>
tag. The action
attribute specifies the URL that will handle the user's input once its submitted. As you can see, we're cheating a little bit here by passing the buck to Google. The two most common values for the method
attribute are "get" and "post." Here, we're using "get." More on this later.<input>
tag with type
attribute specified as "text" creates a textbox input for our form. The <input>
tag with the type
attribute specified as "submit" creates a submit button with the label set according to the value
attribute.When we navigate to http://localhost/search0.html, we get a Forbidden error. By default, our filesystem assumes that we don't want our files readable by everyone. In the case of a webpage, we probably do want it to be readable by everyone. To see who has permission to read a file, we run the ls -l
command. On the far left, we'll see sequences of dashes that look something like this:
-rw-------
We can separate these out into a single dash followed by three groups of three:
- rw- --- ---
The first dash is filled in with a d
if this item is a directory. The three groups of three dashes define the execute, read, and write permissions of the owner, the group, and the world, respectively. The fact that the third group for the file above is ---
means that the world has no permissions on this file. If we want to be able to view this file as a webpage, we need to give the world read permissions. To do this, we run the following command:
chmod a+r search0.html
r
permissions for a
, or all, on the file search0.html
.Now, when we navigate to http://localhost/search0.html, we see an ugly looking search engine. If we type "quick brown fox" and hit Enter, we are whisked away to a search results page provided by Google. How did this work? If we peek at the address bar, we see that the URL we've landed on is the following:
http://www.google.com/search?q=quick+brown+fox
action
attribute. The ?
signifies that a series of key-value pairs are about to be provided. These key-value pairs are parameters that the user provided in the form. In this case, we named our parameter q
for query, so we have q=quick+brown+fox
. The +
denotes a space.Under the hood, our request to render search0.html
looks something like this:
GET /search0.html HTTP/1.1
To view the HTTP, we can click on the Network tab in Chrome's (or IE's or Firefox's) Developer Tools. We then can see that the response begins with the following:
HTTP/1.1 200 OK
If we keep the Network tab open when we click the CS50 Search button, we see a whole bunch of files are downloaded. Looking at just the first, we see that the request headers start with the following:
GET /search?q=quick+brown+fox HTTP/1.1
Host: www.google.com
traceroute
to actually see the path that our request takes. We know that Stanford is across the country, so let's run traceroute www.stanford.edu
. Each of the lines in the output represents a router that our request went through. The first few are Harvard's routers, but by step 6, we see that we're in Kansas already! On the right side, there are three time values which represent three measurements of the number of milliseconds it took to reach this router. Next to the Kansas router, the values are around 60 milliseconds. Pretty cool how fast we can get to Kansas and back! Lines that are just three asterisks represent routers that ignore this type of request, so we don't know where they are. If we run traceroute www.cam.ac.uk
for University of Cambridge's website, we can see there's a big gap in time between two of the steps where our request is traversing the Atlantic Ocean.traceroute
to watch our request hop from router to router. TCP/IP dictates that every computer be identifiable by an IP address of the form w.x.y.z, where w, x, y, and z are numbers from 0 to 255. This implies that IP addresses are 32 bits, so there are 4 billion possible values. Believe it or not, we're actually running out of IP addresses. As a result, there's been a push to move toward IPv6 which uses 128 bits for addresses.nslookup www.cnn.com
. Interestingly, you can take the IP address that this command returns and navigate to it in your browser to see CNN's homepage! The process of translating a URL like www.cnn.com into an IP address is known as a DNS lookup. DNS stands for domain name system. There are special servers called DNS servers that respond to these lookups.