1 00:00:00,000 --> 00:00:03,234 >> [MUSIC PLAYING] 2 00:00:03,234 --> 00:00:05,275 3 00:00:05,275 --> 00:00:06,400 ROBERT KRABEK: Hello, guys. 4 00:00:06,400 --> 00:00:09,980 My name is Robert Krabek, and I will be teaching you guys 5 00:00:09,980 --> 00:00:15,470 how to scrape the web with Nokogiri, which is a Ruby library, 6 00:00:15,470 --> 00:00:17,566 and Kimono, which is a Chrome extension. 7 00:00:17,566 --> 00:00:20,940 8 00:00:20,940 --> 00:00:25,010 >> So first there's a couple things that you 9 00:00:25,010 --> 00:00:28,790 can do if maybe you've been doing all the psets so far 10 00:00:28,790 --> 00:00:31,170 and your workspace is getting a little full. 11 00:00:31,170 --> 00:00:37,060 We can actually just go and create a new workspace for you 12 00:00:37,060 --> 00:00:41,220 to just do a brand new project in. 13 00:00:41,220 --> 00:00:46,160 So if you do want to continue working in the CS50 template ID 14 00:00:46,160 --> 00:00:49,080 that you currently have, feel free, and you can just 15 00:00:49,080 --> 00:00:54,700 install Nokogiri with CFLAGS equals-- gem install nokogiri. 16 00:00:54,700 --> 00:00:56,930 But otherwise I'll show you how to set a new one up. 17 00:00:56,930 --> 00:01:01,210 And then this is essentially dropping more training wheels. 18 00:01:01,210 --> 00:01:07,120 And you're coding as if you were just coding in Sublime or something. 19 00:01:07,120 --> 00:01:12,365 So if we shift it over. 20 00:01:12,365 --> 00:01:14,930 21 00:01:14,930 --> 00:01:18,690 >> So say this is your current CS 50 ID. 22 00:01:18,690 --> 00:01:21,490 You can just go to Cloud9 here. 23 00:01:21,490 --> 00:01:22,725 You can go to your dashboard. 24 00:01:22,725 --> 00:01:26,720 25 00:01:26,720 --> 00:01:29,950 It should bring up Workspaces tab. 26 00:01:29,950 --> 00:01:32,980 And then you can just click here, Create a New Workspace. 27 00:01:32,980 --> 00:01:37,600 Name your new workspace, maybe test, or scraping. 28 00:01:37,600 --> 00:01:42,700 And then click this custom tab here, instead of the CS50 templates tab. 29 00:01:42,700 --> 00:01:45,155 And then you can just go and create a new workspace. 30 00:01:45,155 --> 00:01:48,280 >> I've already created a workspace here. 31 00:01:48,280 --> 00:01:50,640 So we'll be working with this. 32 00:01:50,640 --> 00:01:55,380 And if you created a new workspace so with the Custom tab, 33 00:01:55,380 --> 00:02:04,560 you can just type gem install nokogiri, which is not going here. 34 00:02:04,560 --> 00:02:06,230 OK, it's a little frozen. 35 00:02:06,230 --> 00:02:08,979 But you can type gem install nokogiri. 36 00:02:08,979 --> 00:02:15,970 And that should be all that there is to the installation. 37 00:02:15,970 --> 00:02:20,590 >> As I said before, if you're still working in your CS50 template ID, 38 00:02:20,590 --> 00:02:30,270 you just need to type CFLAGS equals gem install nokogiri. 39 00:02:30,270 --> 00:02:33,130 And I've already installed it here so I won't do that. 40 00:02:33,130 --> 00:02:38,500 But for those following along, feel free to do so. 41 00:02:38,500 --> 00:02:46,000 >> So once you've got your Nokogiri workspace or library installed, 42 00:02:46,000 --> 00:02:49,500 I'm going to give you a little bit of a crash course in Ruby syntax 43 00:02:49,500 --> 00:02:53,380 because Nokogiri is a Ruby library. 44 00:02:53,380 --> 00:03:03,710 So you'll need to know some basic Ruby syntax for working with Nokogiri. 45 00:03:03,710 --> 00:03:08,750 So some basic differences from what you're used to 46 00:03:08,750 --> 00:03:13,370 perhaps if you've been working so far in just C and PHP, 47 00:03:13,370 --> 00:03:16,010 you declare variables with no type. 48 00:03:16,010 --> 00:03:19,720 You don't use semicolons, which is kind of a relief. 49 00:03:19,720 --> 00:03:25,480 There's no parentheses now around for or while loops, for example. 50 00:03:25,480 --> 00:03:29,460 You just have a block of code, and then you put end at the end of that. 51 00:03:29,460 --> 00:03:32,380 There's no plus plus or minus minus, so just 52 00:03:32,380 --> 00:03:36,180 know that for when you're doing for loops, 53 00:03:36,180 --> 00:03:38,620 just plus equals and minus equals. 54 00:03:38,620 --> 00:03:43,310 And instead of hash include, you'll use require and then 55 00:03:43,310 --> 00:03:47,755 whatever library trying to load into your program. 56 00:03:47,755 --> 00:03:51,610 57 00:03:51,610 --> 00:03:53,430 >> Ruby isn't a compiled language. 58 00:03:53,430 --> 00:03:55,550 So that's another relief. 59 00:03:55,550 --> 00:03:59,350 It's more similar to PHP where it's an interpreted language. 60 00:03:59,350 --> 00:04:03,570 You can run any Ruby script that you write with Ruby followed 61 00:04:03,570 --> 00:04:07,380 by the name of your script or program. 62 00:04:07,380 --> 00:04:13,000 To signify that it's a Ruby program, you just end it with .rb instead of .c. 63 00:04:13,000 --> 00:04:17,440 And there are variable sized arrays in Ruby, 64 00:04:17,440 --> 00:04:23,200 which is super convenient when you're scraping and perhaps want to append 65 00:04:23,200 --> 00:04:26,090 data that you've scraped into an array. 66 00:04:26,090 --> 00:04:31,960 You don't have to malloc a new array and copy the old array into the new array. 67 00:04:31,960 --> 00:04:36,150 You can just append with the two arrow signs. 68 00:04:36,150 --> 00:04:39,820 And there are no chars, there just single letter strings. 69 00:04:39,820 --> 00:04:44,760 So that should be a little easier. 70 00:04:44,760 --> 00:04:50,130 >> So we'll just give you some examples of some basic Ruby syntax. 71 00:04:50,130 --> 00:04:57,100 So here you can see that instead of the slash slash, to comment in Ruby, 72 00:04:57,100 --> 00:04:58,740 you just use the pound sign. 73 00:04:58,740 --> 00:05:04,990 And variable declaration, you just type the variable equals 74 00:05:04,990 --> 00:05:07,971 whatever you want the variable to be. 75 00:05:07,971 --> 00:05:09,220 They can be strings. 76 00:05:09,220 --> 00:05:14,120 You can have array, which you populate with values. 77 00:05:14,120 --> 00:05:17,240 puts and prints are similar. 78 00:05:17,240 --> 00:05:20,110 For our purposes, the only difference is really 79 00:05:20,110 --> 00:05:25,500 that puts, which stands for puts, just puts a new line 80 00:05:25,500 --> 00:05:27,440 character at whatever you're printing. 81 00:05:27,440 --> 00:05:30,980 >> So if we give a small demonstration here, 82 00:05:30,980 --> 00:05:41,800 we can run this with-- open a new terminal. 83 00:05:41,800 --> 00:05:46,020 You can see all of these files that are in my terminal. 84 00:05:46,020 --> 00:05:50,960 And if I just run Ruby, ruby intro.rb, it 85 00:05:50,960 --> 00:05:53,530 puts out five Hello Mather, Quincy, Carrier. 86 00:05:53,530 --> 00:05:54,410 Adams. 87 00:05:54,410 --> 00:05:59,295 So that's all there is to declaring arrays. 88 00:05:59,295 --> 00:06:01,670 AUDIENCE: Robert, can you make your font a little bigger? 89 00:06:01,670 --> 00:06:02,461 ROBERT KRABEK: Yes. 90 00:06:02,461 --> 00:06:05,370 91 00:06:05,370 --> 00:06:12,280 And I can zoom in because you can't zoom in to terminal fonts apparently. 92 00:06:12,280 --> 00:06:18,790 93 00:06:18,790 --> 00:06:24,630 >> So that's how you print variables to your terminal. 94 00:06:24,630 --> 00:06:28,820 You can also use variables inside a string. 95 00:06:28,820 --> 00:06:33,720 So recently in PHP, you might have learned 96 00:06:33,720 --> 00:06:37,340 that there is string interpolation. 97 00:06:37,340 --> 00:06:43,830 So if you take a look here, if I declare three variables, name, library, 98 00:06:43,830 --> 00:06:49,700 and language, and I puts, I write a string, hello my name is. 99 00:06:49,700 --> 00:06:54,190 And then instead of the PHP version of string interpolation 100 00:06:54,190 --> 00:06:58,960 which looks a little more like this, you have a pound sign, and then 101 00:06:58,960 --> 00:07:01,220 a curly brace, and then the name of the variable. 102 00:07:01,220 --> 00:07:07,350 And that's how you'd print, say, whatever the variable name is. 103 00:07:07,350 --> 00:07:10,140 >> And then you can also concatenate strings. 104 00:07:10,140 --> 00:07:12,890 Ruby makes it super easy with the plus sign. 105 00:07:12,890 --> 00:07:16,110 You just have one string on the left plus a variable 106 00:07:16,110 --> 00:07:18,860 or another string plus a string. 107 00:07:18,860 --> 00:07:23,500 So if I print this out, it should just say Hello, my name is Robert. 108 00:07:23,500 --> 00:07:27,340 I will be teaching you nokogiri in Ruby. 109 00:07:27,340 --> 00:07:35,370 >> And let's just confirm that that is indeed the case-- ruby intro. 110 00:07:35,370 --> 00:07:36,480 Hello, my name is Robert. 111 00:07:36,480 --> 00:07:40,160 I will be teaching you nokogiri in Ruby. 112 00:07:40,160 --> 00:07:45,600 >> Moving on, if else statements, it's a little different 113 00:07:45,600 --> 00:07:49,800 from what you might be used to if you've been working in C. 114 00:07:49,800 --> 00:07:53,200 You don't need the parentheses. 115 00:07:53,200 --> 00:07:55,220 You don't need the curly braces. 116 00:07:55,220 --> 00:08:00,170 And instead of else if, it's a concatenated elsif. 117 00:08:00,170 --> 00:08:07,260 So in here, if I've declared x up here, as we can see, x is still 5. 118 00:08:07,260 --> 00:08:11,100 So if x is less than 3, it'll put small. 119 00:08:11,100 --> 00:08:14,030 If it's less than 7, medium, else large. 120 00:08:14,030 --> 00:08:17,340 So 5 is a medium number. 121 00:08:17,340 --> 00:08:22,270 And I end this block of code with end. 122 00:08:22,270 --> 00:08:24,920 >> Here is my for loop. 123 00:08:24,920 --> 00:08:28,240 And this syntax is also slightly different. 124 00:08:28,240 --> 00:08:33,500 The 0 to five just essentially is declaring an arrays of 0 to 5. 125 00:08:33,500 --> 00:08:36,120 So there's five slots in the array. 126 00:08:36,120 --> 00:08:40,500 And then for each slot in that array, I will be incrementing i. 127 00:08:40,500 --> 00:08:46,080 So this should print 0 to 5, or 0 to 4. 128 00:08:46,080 --> 00:08:49,630 And this should print medium. 129 00:08:49,630 --> 00:08:51,370 >> And I'll just blaze through. 130 00:08:51,370 --> 00:08:54,466 You guys will have access to this code later on. 131 00:08:54,466 --> 00:08:55,965 So you guys can run this yourselves. 132 00:08:55,965 --> 00:09:02,090 133 00:09:02,090 --> 00:09:06,620 >> So this is your basic while loop. 134 00:09:06,620 --> 00:09:12,230 This will just be printing j, incrementing by 1 until we hit 5. 135 00:09:12,230 --> 00:09:18,320 >> Super quick Ruby crash course on how to write a function. 136 00:09:18,320 --> 00:09:24,460 Instead of, say, int factorial number, we just have def. 137 00:09:24,460 --> 00:09:28,450 And essentially you're defining a function here. 138 00:09:28,450 --> 00:09:30,600 This is going to be the name of the function, 139 00:09:30,600 --> 00:09:34,280 and this is any variables that you want to pass into the function. 140 00:09:34,280 --> 00:09:36,760 You can have if statements within. 141 00:09:36,760 --> 00:09:38,030 You can return. 142 00:09:38,030 --> 00:09:42,620 In this case, we're defining a recursively 143 00:09:42,620 --> 00:09:45,000 implemented factorial function. 144 00:09:45,000 --> 00:09:48,660 So we just call functions in Ruby like this. 145 00:09:48,660 --> 00:09:54,700 >> So if I've defined this, I can call factorial, pass in 3, 146 00:09:54,700 --> 00:09:59,700 and then 3 will be the number variable that I can use within the function. 147 00:09:59,700 --> 00:10:08,010 And this to_s is just turning the return value of factorial into a string. 148 00:10:08,010 --> 00:10:10,760 Otherwise this will throw an error saying oh, I 149 00:10:10,760 --> 00:10:13,230 can't print a string-- because as you remember, 150 00:10:13,230 --> 00:10:18,230 puts is put string-- because this factorial has returned a number. 151 00:10:18,230 --> 00:10:21,850 So we can convert that to a string like such. 152 00:10:21,850 --> 00:10:27,856 And conversely, you can also convert a string to an integer with to_i. 153 00:10:27,856 --> 00:10:32,650 >> So making everything super simple, if I just comment this out, save 154 00:10:32,650 --> 00:10:36,250 and run the factorial function. 155 00:10:36,250 --> 00:10:39,850 We should be able to see that factorial of 3 is 6. 156 00:10:39,850 --> 00:10:42,790 And that is indeed true. 157 00:10:42,790 --> 00:10:46,160 >> So that's your crash course in Ruby. 158 00:10:46,160 --> 00:10:53,550 And now that you know Ruby, we can go on to the basic Nokogiri scraping set up. 159 00:10:53,550 --> 00:10:58,190 Essentially all you have to do is, in Ruby, require the libraries. 160 00:10:58,190 --> 00:11:04,390 And for our purposes we'll be using the library OpenURI as well as Nokogiri. 161 00:11:04,390 --> 00:11:07,870 And then what you do-- and it'll give you the syntax for this-- 162 00:11:07,870 --> 00:11:16,010 is you open the URL much as you would in a cURL request, which stands for C URL. 163 00:11:16,010 --> 00:11:20,330 >> So you take the URL of the website in question. 164 00:11:20,330 --> 00:11:22,030 You store it in a variable. 165 00:11:22,030 --> 00:11:27,400 And then you can search through that variable for unique HTML tags using 166 00:11:27,400 --> 00:11:30,590 the .css command. 167 00:11:30,590 --> 00:11:34,360 And then you can output the content to wherever you want. 168 00:11:34,360 --> 00:11:35,720 You can start in a database. 169 00:11:35,720 --> 00:11:42,040 You can output in a file, or even just print it to the screen. 170 00:11:42,040 --> 00:11:47,290 >> So we'll show you a basic scraper. 171 00:11:47,290 --> 00:11:52,570 So up here you can see we have requiring nokogiri, require open-uri. 172 00:11:52,570 --> 00:11:57,150 Your basic set up, let's call it document or doc, 173 00:11:57,150 --> 00:12:07,780 equals Nokogiri::HTML open, which is the command provided to us by the OpenURI 174 00:12:07,780 --> 00:12:08,920 library. 175 00:12:08,920 --> 00:12:14,000 And we'll be searching, for those of you who might be living in the quad, 176 00:12:14,000 --> 00:12:21,270 for bikes that are in Boston listed on the Boston Craigslist bike section 177 00:12:21,270 --> 00:12:22,020 site. 178 00:12:22,020 --> 00:12:26,460 >> So if you are unfamiliar with cURL, I'll just 179 00:12:26,460 --> 00:12:28,930 show you real quick what cURL will do. 180 00:12:28,930 --> 00:12:38,350 If I wanted to get all of the URL from the Craigslist site, if I type curl, 181 00:12:38,350 --> 00:12:44,950 it just dumps all of the URL from the Craigslist bicycle site 182 00:12:44,950 --> 00:12:46,720 onto my terminal. 183 00:12:46,720 --> 00:12:49,130 That's not particularly useful because I don't 184 00:12:49,130 --> 00:12:53,330 want to manually go through and find the thing I'm looking for. 185 00:12:53,330 --> 00:13:01,590 But just so you can see that I'm actually 186 00:13:01,590 --> 00:13:13,966 using the right code, if you look at the URL for Craigslist in bikes-- 187 00:13:13,966 --> 00:13:17,460 for some reason it's not found. 188 00:13:17,460 --> 00:13:20,340 If you look at this page and you look at the URL, 189 00:13:20,340 --> 00:13:23,970 this should be identical to the cURL request that I just send. 190 00:13:23,970 --> 00:13:27,700 And indeed, that's what's being stored in the doc variable. 191 00:13:27,700 --> 00:13:36,540 >> So when you go back to our code, we can then operate on this doc variable 192 00:13:36,540 --> 00:13:40,660 by using .css. 193 00:13:40,660 --> 00:13:49,240 So say I wanted to get all of the tags that are span.txt, 194 00:13:49,240 --> 00:13:51,740 and all the a tags within that tag. 195 00:13:51,740 --> 00:13:56,150 And why might we want to do this, I hear you cry? 196 00:13:56,150 --> 00:14:02,920 >> If we Inspect Element, it gives you a breakdown of how the URL is structured. 197 00:14:02,920 --> 00:14:06,200 If I scroll down through here, you can see 198 00:14:06,200 --> 00:14:08,770 what each of these different elements represents. 199 00:14:08,770 --> 00:14:13,410 So maybe I want to access this particular element. 200 00:14:13,410 --> 00:14:16,820 So I'm using Chrome developer tools to Inspect Element. 201 00:14:16,820 --> 00:14:22,970 I can see down here that this is an a tag within a span 202 00:14:22,970 --> 00:14:26,230 tag with a class of txt. 203 00:14:26,230 --> 00:14:29,610 >> So this gets to our first operation which 204 00:14:29,610 --> 00:14:37,330 is doc.css span, which is the tag that I'm looking for within all this URL. 205 00:14:37,330 --> 00:14:43,650 And then .txt operates much like CSS does when you're just writing CSS 206 00:14:43,650 --> 00:14:49,630 in your HTML files by specifying a class. 207 00:14:49,630 --> 00:14:57,980 So this particular operator will specify a span tag with class of txt. 208 00:14:57,980 --> 00:15:02,800 And then if I leave a space, this will then go within that tag 209 00:15:02,800 --> 00:15:05,170 and then find an a tag within that. 210 00:15:05,170 --> 00:15:10,750 >> So if I just put this to the terminal, I should 211 00:15:10,750 --> 00:15:21,630 be able to see essentially everything that is within this span of class txt. 212 00:15:21,630 --> 00:15:22,890 So we'll give that a go. 213 00:15:22,890 --> 00:15:25,870 214 00:15:25,870 --> 00:15:27,756 ruby craigslist-scraper. 215 00:15:27,756 --> 00:15:31,850 216 00:15:31,850 --> 00:15:37,250 And indeed that gives us all of these tags of the various listings that 217 00:15:37,250 --> 00:15:40,400 are on the Craigslist page. 218 00:15:40,400 --> 00:15:45,670 >> So if we go back, we can turn this into something a little more useful. 219 00:15:45,670 --> 00:15:51,050 Maybe we want just the links. 220 00:15:51,050 --> 00:15:58,790 Because within this tag, I'll also have the hyperlink of the path 221 00:15:58,790 --> 00:16:00,590 that this page goes to. 222 00:16:00,590 --> 00:16:09,100 So if you look at this code here, what I'll do is instead of .css, 223 00:16:09,100 --> 00:16:12,380 I can go at_css. 224 00:16:12,380 --> 00:16:16,820 And this will just get the first element of all of those things. 225 00:16:16,820 --> 00:16:20,890 So if I were to do that up in the code I just previously demonstrated, 226 00:16:20,890 --> 00:16:23,800 instead of returning all of this, it would just 227 00:16:23,800 --> 00:16:26,850 return the first one of those. 228 00:16:26,850 --> 00:16:31,310 So that's how the at_css operator works. 229 00:16:31,310 --> 00:16:39,460 >> So we want to store the path all of the first a tag. 230 00:16:39,460 --> 00:16:47,430 And because a will give us a-- so we're still going to use .css. 231 00:16:47,430 --> 00:16:53,830 But because this is going to give us back an entire array of tags, 232 00:16:53,830 --> 00:16:55,710 we are going to access the first element. 233 00:16:55,710 --> 00:17:01,700 So this is another way that you can access any particular element if you 234 00:17:01,700 --> 00:17:04,810 have an array of elements that is returned, 235 00:17:04,810 --> 00:17:11,930 because you can treat anything that .css returns as an array, essentially. 236 00:17:11,930 --> 00:17:16,880 And then we're going to access the hypertext reference attribute of this. 237 00:17:16,880 --> 00:17:24,810 >> So if you take a look, if you looked really close here, 238 00:17:24,810 --> 00:17:28,270 if you just essentially look at the URL bar, 239 00:17:28,270 --> 00:17:33,880 this is the path that you're going to be scraping. 240 00:17:33,880 --> 00:17:41,565 So if we just run this again, and make sure we've saved it. 241 00:17:41,565 --> 00:17:47,040 242 00:17:47,040 --> 00:17:48,300 You can check at home. 243 00:17:48,300 --> 00:17:51,430 This actually matches up with this link. 244 00:17:51,430 --> 00:17:55,950 >> So why might we want to use this? 245 00:17:55,950 --> 00:17:57,870 If you want to scrape the page and it has 246 00:17:57,870 --> 00:18:00,270 a page of links like Craigslist does, you 247 00:18:00,270 --> 00:18:03,210 might want to go then into each of those links 248 00:18:03,210 --> 00:18:05,120 and then scrape the content of that, which 249 00:18:05,120 --> 00:18:08,520 is exactly what we're going to do. 250 00:18:08,520 --> 00:18:11,660 >> So once you have path as a variable, I no longer really 251 00:18:11,660 --> 00:18:13,200 care about printing it out. 252 00:18:13,200 --> 00:18:15,420 I just need to store it as a variable. 253 00:18:15,420 --> 00:18:20,980 And then I can access another page the same way I access 254 00:18:20,980 --> 00:18:22,260 doc in the first place. 255 00:18:22,260 --> 00:18:25,920 Except with the URL, we're going to use string interpolation 256 00:18:25,920 --> 00:18:29,180 like I was describing in Ruby earlier on to append 257 00:18:29,180 --> 00:18:32,010 the path to the end of the root. 258 00:18:32,010 --> 00:18:38,970 >> So what this is going to do is this is going to put on the path 259 00:18:38,970 --> 00:18:42,360 that I scraped previously and then turn that 260 00:18:42,360 --> 00:18:49,580 into a new item, whatever you want to call it-- first_listing, for example. 261 00:18:49,580 --> 00:18:52,900 But I'm going to leave it on item for now, 262 00:18:52,900 --> 00:18:55,420 because that is what I'm using here. 263 00:18:55,420 --> 00:19:02,900 >> So say I wanted to get the description of the first posting in Craigslist. 264 00:19:02,900 --> 00:19:04,740 So I would go down here. 265 00:19:04,740 --> 00:19:10,660 I would click on Inspect Element again, because this is the description. 266 00:19:10,660 --> 00:19:14,350 I'd go down here and see if I can find how I might 267 00:19:14,350 --> 00:19:16,530 be able to search for this unique tag. 268 00:19:16,530 --> 00:19:19,530 And in this case, it has an ID, which leads us 269 00:19:19,530 --> 00:19:26,810 to our next way of searching for tags, which is with a hashtag. 270 00:19:26,810 --> 00:19:30,670 >> So for classes, you can use the dot operator. 271 00:19:30,670 --> 00:19:38,610 So .txt is specifying a class of txt, whereas the hash specifies an ID. 272 00:19:38,610 --> 00:19:43,720 So in this case, the tag is section, and the ID is postingbody. 273 00:19:43,720 --> 00:19:47,780 >> So this goes and finds the first-- because we're 274 00:19:47,780 --> 00:19:51,200 using at_css-- this goes and finds the first element that 275 00:19:51,200 --> 00:19:57,180 comes up with the tag of section and the ID of postingbody. 276 00:19:57,180 --> 00:20:02,636 And then you can access the text element of that item returned with .text. 277 00:20:02,636 --> 00:20:06,230 And then we can store that in the description. 278 00:20:06,230 --> 00:20:09,370 >> So now that we have a variable description, 279 00:20:09,370 --> 00:20:14,850 we might be able to do, say, file I/O. So file I/O in Ruby 280 00:20:14,850 --> 00:20:21,310 is very similar to file I/O in C where we open a file. 281 00:20:21,310 --> 00:20:23,260 We might write to it. 282 00:20:23,260 --> 00:20:25,060 And then we'll close that file. 283 00:20:25,060 --> 00:20:29,660 >> So here, we're just naming the file, some arbitrary variable. 284 00:20:29,660 --> 00:20:33,120 We could also have just put this here. 285 00:20:33,120 --> 00:20:39,630 We have a variable that we're storing the open file as with File.open. 286 00:20:39,630 --> 00:20:46,370 And we're writing to this file, so we open it with the w operator. 287 00:20:46,370 --> 00:20:54,280 And then we put string into the file with the .puts operator. 288 00:20:54,280 --> 00:20:58,310 And then we put the variable that we want to write to the file within that. 289 00:20:58,310 --> 00:21:00,200 And then we just close the file. 290 00:21:00,200 --> 00:21:04,000 >> So if we go ahead and run this, this should produce a document 291 00:21:04,000 --> 00:21:10,840 with description.txt which will have this description within it. 292 00:21:10,840 --> 00:21:14,015 So if I run it-- no. 293 00:21:14,015 --> 00:21:17,520 294 00:21:17,520 --> 00:21:23,330 It's produced a text file with, hopefully, the same thing. 295 00:21:23,330 --> 00:21:25,850 296 00:21:25,850 --> 00:21:33,290 So there might have been a new posting that's come up while I've been talking. 297 00:21:33,290 --> 00:21:36,580 And indeed it looks like there has been. 298 00:21:36,580 --> 00:21:43,380 So if we go to this classic bike, 1962 to 1966, that seems to match. 299 00:21:43,380 --> 00:21:45,620 And there you go. 300 00:21:45,620 --> 00:21:51,250 >> So that's the most basic functionality of scraping. 301 00:21:51,250 --> 00:21:57,510 We could have instead of just writing to this file, 302 00:21:57,510 --> 00:21:59,930 we can add things to an array. 303 00:21:59,930 --> 00:22:03,770 So if I declare three arrays, title, price, and description. 304 00:22:03,770 --> 00:22:06,310 305 00:22:06,310 --> 00:22:13,790 And we're operating on the doc item now. 306 00:22:13,790 --> 00:22:16,940 We can go through and find all of the span.txt. 307 00:22:16,940 --> 00:22:21,710 And remember, this returns an array of all the items that it finds. 308 00:22:21,710 --> 00:22:27,300 And then in Ruby, you can just use .each to iterate through every item 309 00:22:27,300 --> 00:22:28,410 of the array. 310 00:22:28,410 --> 00:22:31,330 And then for each item, I'm just going to call it 311 00:22:31,330 --> 00:22:34,620 a link, because that's essentially what it is. 312 00:22:34,620 --> 00:22:46,830 >> So if I put each link.css dot a.hdrlnk, this is actually going to the link 313 00:22:46,830 --> 00:22:58,280 and finding within that link another HTML element and corresponding class. 314 00:22:58,280 --> 00:23:04,990 So if we remember what this was, the span.txt, 315 00:23:04,990 --> 00:23:13,160 you can see- let me just go back real quick-- within span.txt 316 00:23:13,160 --> 00:23:17,490 we have a lot of other classes. 317 00:23:17,490 --> 00:23:27,180 So inside span.txt, we're looking for an a tag with a class hdrlnk. 318 00:23:27,180 --> 00:23:29,890 So let me just find that for you guys real quick. 319 00:23:29,890 --> 00:23:37,390 320 00:23:37,390 --> 00:23:42,850 >> So you can see here, this is an a tag that's within the span of class txt 321 00:23:42,850 --> 00:23:44,920 that has the class hdrlnk. 322 00:23:44,920 --> 00:23:47,610 And that's indeed what we're trying to get. 323 00:23:47,610 --> 00:23:54,680 >> So we're now trying to store all of those links inside the title. 324 00:23:54,680 --> 00:23:59,545 And then we're going to print out each of those links. 325 00:23:59,545 --> 00:24:00,360 No, sorry. 326 00:24:00,360 --> 00:24:04,530 We're going to print out the price of each of those. 327 00:24:04,530 --> 00:24:09,350 So let's run this really quick and see what it does. 328 00:24:09,350 --> 00:24:14,680 329 00:24:14,680 --> 00:24:17,720 >> So this just basically went through each of the links 330 00:24:17,720 --> 00:24:27,310 in turn, accessed the tag in question, and then pulled out the price. 331 00:24:27,310 --> 00:24:33,910 And it did that because after you have everything in the title, 332 00:24:33,910 --> 00:24:37,260 we've just stored the title there. 333 00:24:37,260 --> 00:24:40,180 We've just stored the link within the array title. 334 00:24:40,180 --> 00:24:47,720 And in this for loop operation, where instead of going to a.hdrlnk, 335 00:24:47,720 --> 00:24:50,490 we're looking for a span.price. 336 00:24:50,490 --> 00:24:56,500 So if I can just really quickly find the price, if you inspect the element, 337 00:24:56,500 --> 00:25:00,610 you'll see that it is a span with the class of price. 338 00:25:00,610 --> 00:25:04,670 And that's essentially how we're getting the price there. 339 00:25:04,670 --> 00:25:10,040 >> So that's the really basic case of scraping. 340 00:25:10,040 --> 00:25:13,550 That's how you get all the elements on a page 341 00:25:13,550 --> 00:25:16,510 that, say, you already know the URL of. 342 00:25:16,510 --> 00:25:21,050 >> So if we want to get a little more in depth, 343 00:25:21,050 --> 00:25:23,950 we can scrape pages within pages. 344 00:25:23,950 --> 00:25:28,480 And for this example, I'll be outputting to a CSV file. 345 00:25:28,480 --> 00:25:39,510 So I'm requiring csv up here because Ruby doesn't, inside itself, 346 00:25:39,510 --> 00:25:42,350 have the functionality to just output CSV files. 347 00:25:42,350 --> 00:25:45,030 So that's super simple. 348 00:25:45,030 --> 00:25:48,710 Let me just go to the next. 349 00:25:48,710 --> 00:25:51,640 350 00:25:51,640 --> 00:25:57,170 We covered file I/O. So this is similar to how it is in C. 351 00:25:57,170 --> 00:26:00,870 And before we move on to Kimono, I'll just show you really quick how 352 00:26:00,870 --> 00:26:02,790 to scrape sites within sights. 353 00:26:02,790 --> 00:26:10,040 >> So we already learned how to declare arrays in Ruby. 354 00:26:10,040 --> 00:26:13,280 So I'm just declaring a bunch of arbitrary arrays 355 00:26:13,280 --> 00:26:16,310 that I will be storing data within. 356 00:26:16,310 --> 00:26:20,680 doc is operating the same way as it did in the previous file. 357 00:26:20,680 --> 00:26:23,580 We're going in, finding each of the span.txt's. 358 00:26:23,580 --> 00:26:25,040 We already know that. 359 00:26:25,040 --> 00:26:32,130 That is the container within which each link has all of the data that we want. 360 00:26:32,130 --> 00:26:40,800 >> So here what we're doing is for each link of span class txt, we're going in 361 00:26:40,800 --> 00:26:45,720 and we're finding the a tag, finding the first element of that. 362 00:26:45,720 --> 00:26:49,937 Remember, .css returns an array, so you can't just access it as is. 363 00:26:49,937 --> 00:26:51,520 We're going to find the first element. 364 00:26:51,520 --> 00:26:56,430 Even if it's an array of one item, you have to use this syntax, 365 00:26:56,430 --> 00:26:58,800 and then pull out the href attribute. 366 00:26:58,800 --> 00:27:01,800 >> So we did this earlier. 367 00:27:01,800 --> 00:27:04,440 So this should look familiar. 368 00:27:04,440 --> 00:27:14,330 And so now we have an array called paths of all of our links 369 00:27:14,330 --> 00:27:16,590 that we're going to want to use. 370 00:27:16,590 --> 00:27:21,350 So if we have this array of all of the paths that we want to use, 371 00:27:21,350 --> 00:27:26,840 we can then create an item for each of those pages when we open that page. 372 00:27:26,840 --> 00:27:31,150 So as we also saw on the syntax before, where 373 00:27:31,150 --> 00:27:37,450 doing string interpolation with the path here, so the syntax is just for path. 374 00:27:37,450 --> 00:27:41,450 And I could name this variable any arbitrary name. 375 00:27:41,450 --> 00:27:43,070 >> This is the important one. 376 00:27:43,070 --> 00:27:46,650 This is the array that you'll be accessing each element. 377 00:27:46,650 --> 00:27:52,400 But when you say for path in paths, this means for each element in paths, 378 00:27:52,400 --> 00:27:55,150 call it path, and use that. 379 00:27:55,150 --> 00:27:59,266 This is essentially like when you do a for loop and you use int i. 380 00:27:59,266 --> 00:28:04,000 So you can treat the path as the variable that's incrementing. 381 00:28:04,000 --> 00:28:07,820 >> And then for each of those, go into each of those links. 382 00:28:07,820 --> 00:28:11,710 Because we're storing it in item page, so we're creating a new page every time 383 00:28:11,710 --> 00:28:13,330 we access it. 384 00:28:13,330 --> 00:28:20,560 And then within that new page, find span.postingtitletext, span.price, 385 00:28:20,560 --> 00:28:22,240 and then section#postingbody. 386 00:28:22,240 --> 00:28:28,430 We already covered section#postingbody when we looked at the description. 387 00:28:28,430 --> 00:28:34,890 >> So we can go see in the Craigslist post, if you're just looking at the title, 388 00:28:34,890 --> 00:28:38,810 you can see it up here, span postingtitletext. 389 00:28:38,810 --> 00:28:41,390 And that's why it's there. 390 00:28:41,390 --> 00:28:49,120 And then for the price, you can access it with span class of price. 391 00:28:49,120 --> 00:28:54,480 >> So we also perhaps might want to store the URL. 392 00:28:54,480 --> 00:28:58,580 So we'll just run this again, store it in an array, 393 00:28:58,580 --> 00:29:01,150 because if you're looking on Craigslist, you're 394 00:29:01,150 --> 00:29:05,290 probably going to want a way to, if you see something that interests you, 395 00:29:05,290 --> 00:29:06,620 go back to that site. 396 00:29:06,620 --> 00:29:10,480 So you just want to store the URL for references sake. 397 00:29:10,480 --> 00:29:13,840 398 00:29:13,840 --> 00:29:19,630 >> This is just essentially another syntax for the for loop. 399 00:29:19,630 --> 00:29:26,360 I could just do paths.each instead of for path in paths with index. 400 00:29:26,360 --> 00:29:31,280 And this syntax is Ruby for-- path is what we did up here, 401 00:29:31,280 --> 00:29:33,920 declaring a variable for each item. 402 00:29:33,920 --> 00:29:38,540 And index behaves like the i in C for loops. 403 00:29:38,540 --> 00:29:41,280 So you can keep track of what the index is. 404 00:29:41,280 --> 00:29:45,200 >> So here is just a little convenient thing 405 00:29:45,200 --> 00:29:46,950 for when you're running the scraper. 406 00:29:46,950 --> 00:29:50,580 If you're scraping hundreds of pages, to make sure that it's not hanging, 407 00:29:50,580 --> 00:29:53,320 it will just output, I'm accessing this page, 408 00:29:53,320 --> 00:29:55,960 and making sure that it's still continuing. 409 00:29:55,960 --> 00:29:59,250 But for our purposes, because there's a hundred items, 410 00:29:59,250 --> 00:30:08,000 I'm going to access just three of them so that we don't run out of time here. 411 00:30:08,000 --> 00:30:13,040 >> But before we get to that, I'm just going to show you really quick, 412 00:30:13,040 --> 00:30:16,940 I will be outputting the title, price, description, and URL 413 00:30:16,940 --> 00:30:19,600 of each of the links that I've scraped. 414 00:30:19,600 --> 00:30:23,720 And then this is just the syntax for the CSV library. 415 00:30:23,720 --> 00:30:25,240 You open a CSV. 416 00:30:25,240 --> 00:30:27,070 This is what I'm going to call it. 417 00:30:27,070 --> 00:30:29,430 Open it with write do. 418 00:30:29,430 --> 00:30:33,830 And then CSV will be the file that you're inputting everything into. 419 00:30:33,830 --> 00:30:37,800 This is just a sanity check for me to know that it's running. 420 00:30:37,800 --> 00:30:41,240 And this is my sanity check to know that it's completed. 421 00:30:41,240 --> 00:30:46,670 So I'm putting title into a row in the CSV, price, url, description, 422 00:30:46,670 --> 00:30:49,420 all into rows in the CSV. 423 00:30:49,420 --> 00:30:53,410 >> So if we go and run this now-- and I just 424 00:30:53,410 --> 00:31:04,710 make sure that I've saved it-- instead of just outputting it to the terminal, 425 00:31:04,710 --> 00:31:09,750 we should have a CSV file that's produced. 426 00:31:09,750 --> 00:31:13,500 So here we can see the CSV file that's been produced. 427 00:31:13,500 --> 00:31:19,330 This is the output of the scape that I just ran. 428 00:31:19,330 --> 00:31:23,030 As you can see here, accessing page 0, 1, 2, 3. 429 00:31:23,030 --> 00:31:27,400 These are the titles, prices, descriptions. 430 00:31:27,400 --> 00:31:31,710 And if we look at this CSV file that we've generated, 431 00:31:31,710 --> 00:31:35,700 you can see its outputted here. 432 00:31:35,700 --> 00:31:40,350 This isn't Excel, so it's not formatted in rows and columns. 433 00:31:40,350 --> 00:31:45,140 But you can imagine how it might be formatted. 434 00:31:45,140 --> 00:31:47,740 >> CSV stands for comma separated values. 435 00:31:47,740 --> 00:31:50,090 So you can imagine this might be a row. 436 00:31:50,090 --> 00:31:54,700 And each comma would indicate a separate column. 437 00:31:54,700 --> 00:32:00,010 Just a word of caution-- sometimes you're 438 00:32:00,010 --> 00:32:02,260 scraping things with a lot of commas. 439 00:32:02,260 --> 00:32:05,100 So if you're outputting it to a CSV file, 440 00:32:05,100 --> 00:32:10,340 it might not output the way you might think. 441 00:32:10,340 --> 00:32:16,770 >> So that's essentially all there is to scraping basic HTML 442 00:32:16,770 --> 00:32:20,110 pages with Nokogiri. 443 00:32:20,110 --> 00:32:26,000 >> So the internet being innovative as it has come up 444 00:32:26,000 --> 00:32:33,220 with a more automated and GUI version, albeit less robust 445 00:32:33,220 --> 00:32:35,540 version of scraping various websites. 446 00:32:35,540 --> 00:32:39,060 And for our purposes I'll be demonstrating 447 00:32:39,060 --> 00:32:42,920 a Chrome extension called Kimono. 448 00:32:42,920 --> 00:32:46,690 And all you have to do is you navigate to the page that you want to scrape. 449 00:32:46,690 --> 00:32:48,590 You click on a field of interest. 450 00:32:48,590 --> 00:32:51,510 You calibrate the fields, because it will automatically 451 00:32:51,510 --> 00:32:54,360 detect what it thinks you want to be scraping, 452 00:32:54,360 --> 00:32:56,280 and then you just create an API. 453 00:32:56,280 --> 00:33:03,700 >> So if we were to demonstrate it on Craigslist, it actually wouldn't work. 454 00:33:03,700 --> 00:33:08,290 And this is what I was going back to saying about it not being as robust. 455 00:33:08,290 --> 00:33:10,320 It has trouble creating the API. 456 00:33:10,320 --> 00:33:13,400 But as a demonstration of what it would do, 457 00:33:13,400 --> 00:33:17,460 if you install the Chrome extension, all you do is you click on it. 458 00:33:17,460 --> 00:33:21,750 It Kimonofies the page, and then you click on the thing you want to script. 459 00:33:21,750 --> 00:33:24,480 >> So if I were to click on that, it would highlight 460 00:33:24,480 --> 00:33:28,130 what it thinks I want to be scraping off that page. 461 00:33:28,130 --> 00:33:33,660 So maybe I call this listings. 462 00:33:33,660 --> 00:33:36,430 This is how many items I have selected. 463 00:33:36,430 --> 00:33:43,810 And I can just confirm or deny some of the other suggested listings 464 00:33:43,810 --> 00:33:49,600 to get it to add to what will be scraped. 465 00:33:49,600 --> 00:33:52,330 >> So now we can see there's a hundred items selected. 466 00:33:52,330 --> 00:33:58,060 If I want to have another field that I also scrape which is related to this, 467 00:33:58,060 --> 00:34:02,540 say I want to scrape the price as well, then I can do the same. 468 00:34:02,540 --> 00:34:06,190 469 00:34:06,190 --> 00:34:11,550 >> So here's a demonstration of how it's much less robust, because now it's 470 00:34:11,550 --> 00:34:15,050 picking up the city instead of just the price that I want. 471 00:34:15,050 --> 00:34:16,989 And now it's picked up 200 things. 472 00:34:16,989 --> 00:34:19,880 You can go back and delete. 473 00:34:19,880 --> 00:34:21,449 You can try again. 474 00:34:21,449 --> 00:34:24,250 But no guarantees. 475 00:34:24,250 --> 00:34:29,909 This is how this works sometimes. 476 00:34:29,909 --> 00:34:32,969 As you see here, now it says 96 up here. 477 00:34:32,969 --> 00:34:37,000 It's picked up most of the links that you want to scrape, but not 478 00:34:37,000 --> 00:34:39,280 necessarily all of them. 479 00:34:39,280 --> 00:34:43,909 >> Another useful tool of Kimono though is you can go to Advanced Features 480 00:34:43,909 --> 00:34:47,980 here, go to Advanced, and it will show you 481 00:34:47,980 --> 00:34:53,139 the breakdown of the unique way to access the HTML 482 00:34:53,139 --> 00:34:54,909 tags that you want to scrape. 483 00:34:54,909 --> 00:35:01,450 So for listings, if you look at here, if you access div p span span a, 484 00:35:01,450 --> 00:35:06,030 you can actually just use this in your Nokogiri code, 485 00:35:06,030 --> 00:35:10,780 where before we had span.txt to access each of the listings. 486 00:35:10,780 --> 00:35:13,270 If I just want the text within the listings, 487 00:35:13,270 --> 00:35:18,950 I could input div space p space span space span space a, 488 00:35:18,950 --> 00:35:21,570 and it would achieve the same effect. 489 00:35:21,570 --> 00:35:26,320 And for those of you who are interested in using regular expressions, 490 00:35:26,320 --> 00:35:31,670 it happens to also give you the regular expression sort of string to input 491 00:35:31,670 --> 00:35:34,900 to find the things you're trying to find. 492 00:35:34,900 --> 00:35:44,130 >> So there's another cool feature of Kimono where you can paginate, 493 00:35:44,130 --> 00:35:47,780 which is not only can I scrape the results of this page, 494 00:35:47,780 --> 00:35:50,890 I can click on this little button here, Pagination, 495 00:35:50,890 --> 00:35:55,580 specify the button that would take me to the next page, 496 00:35:55,580 --> 00:35:59,500 and then it will just know that it can iterate to the next page, 497 00:35:59,500 --> 00:36:04,120 and then scrape all of the-- as long as it's the same format of course-- 498 00:36:04,120 --> 00:36:06,110 scape all of those links as well. 499 00:36:06,110 --> 00:36:15,230 >> So because Kimono doesn't want to work with Craigslist, what we've done 500 00:36:15,230 --> 00:36:19,790 is I've Kimonofied the Harvard Crimson. 501 00:36:19,790 --> 00:36:29,380 I've pulled out some of the sort of top featured articles, confirm here. 502 00:36:29,380 --> 00:36:33,090 Say all of these. 503 00:36:33,090 --> 00:36:35,830 I've compiled this API for you ahead of time. 504 00:36:35,830 --> 00:36:38,990 But otherwise what you would do is you would just click Done. 505 00:36:38,990 --> 00:36:40,940 Enter in your API details. 506 00:36:40,940 --> 00:36:45,260 Set it to either automated or manual crawl. 507 00:36:45,260 --> 00:36:48,460 So you could update your data every 15 minutes, 508 00:36:48,460 --> 00:36:50,330 weekly, daily, whatever you want. 509 00:36:50,330 --> 00:36:51,160 Name your API. 510 00:36:51,160 --> 00:36:52,790 Create the API. 511 00:36:52,790 --> 00:36:58,460 For your benefit, I've created the Crimson front page API already. 512 00:36:58,460 --> 00:37:02,480 >> So you just create an account on Kimono, and it 513 00:37:02,480 --> 00:37:06,240 will store all your APIs for you. 514 00:37:06,240 --> 00:37:10,330 So essentially that's all your separate different scrapes. 515 00:37:10,330 --> 00:37:18,250 >> So if we look here, this is the opinions links that I've collected. 516 00:37:18,250 --> 00:37:21,290 These are the featured links that I've collected. 517 00:37:21,290 --> 00:37:24,090 And these are the most read links that I've collected 518 00:37:24,090 --> 00:37:27,120 from this most recent API scape. 519 00:37:27,120 --> 00:37:30,790 >> So if you can see here, these would be the featured, 520 00:37:30,790 --> 00:37:34,130 these would be the opinions, which in this example, 521 00:37:34,130 --> 00:37:38,150 I've combined them all into one collection. 522 00:37:38,150 --> 00:37:42,780 But if you just play around with it a little bit, you can split it up 523 00:37:42,780 --> 00:37:45,090 and divide it up however you want to as long 524 00:37:45,090 --> 00:37:47,520 as the formatting is slightly different. 525 00:37:47,520 --> 00:37:51,320 >> Just to play around with this, the crawl set up, one of the downsides 526 00:37:51,320 --> 00:37:58,120 is you can only crawl up to 25 pages at a time. 527 00:37:58,120 --> 00:38:00,430 That's one of the limiting factors. 528 00:38:00,430 --> 00:38:03,060 But here, if you set it to manual crawl, this 529 00:38:03,060 --> 00:38:06,100 is how you can tell it to update your data. 530 00:38:06,100 --> 00:38:11,010 And here you can see your crawl history of everything that you've crawled. 531 00:38:11,010 --> 00:38:16,000 And you guys can go back, sign up, play around with all the different ways 532 00:38:16,000 --> 00:38:20,340 that you can modify and use your data. 533 00:38:20,340 --> 00:38:24,580 >> Kimono can be set up to scrape links within links. 534 00:38:24,580 --> 00:38:29,700 And you would do so by first scraping a list of links, 535 00:38:29,700 --> 00:38:35,390 and then using that API as a jump off point for another API 536 00:38:35,390 --> 00:38:36,710 that you create the script. 537 00:38:36,710 --> 00:38:42,040 But that's more complicated than what we're going to get into today. 538 00:38:42,040 --> 00:38:44,270 >> So that's Kimono. 539 00:38:44,270 --> 00:38:46,980 We'll talk about the pros and cons of Nokogiri and Kimono. 540 00:38:46,980 --> 00:38:50,380 >> Nokogiri, it's really fast. 541 00:38:50,380 --> 00:38:51,640 It's easy to test. 542 00:38:51,640 --> 00:38:55,910 You can just puts anything to console, easy to configure. 543 00:38:55,910 --> 00:39:00,400 You can decide exactly what you want to scrape and store. 544 00:39:00,400 --> 00:39:02,060 There are no page limits. 545 00:39:02,060 --> 00:39:08,010 I actually used it to scrape like 1800 South African school websites 546 00:39:08,010 --> 00:39:10,870 for emails for an internship that I did. 547 00:39:10,870 --> 00:39:16,060 >> So that's possible, though best practice would be to split up the script. 548 00:39:16,060 --> 00:39:19,310 Because if it fails, then you don't get anything. 549 00:39:19,310 --> 00:39:22,790 But if you do a hundred, maybe 200 pages at a time, 550 00:39:22,790 --> 00:39:27,840 then you have some chance of at least getting it piecemeal, especially 551 00:39:27,840 --> 00:39:30,280 if you have bad internet. 552 00:39:30,280 --> 00:39:32,720 >> Unfortunately it can only scrape HTML. 553 00:39:32,720 --> 00:39:35,190 So if you have dynamically loaded pages-- 554 00:39:35,190 --> 00:39:39,480 and I'll show you an example like Kayak in a second-- 555 00:39:39,480 --> 00:39:42,270 Nokogiri unfortunately cannot scrape that. 556 00:39:42,270 --> 00:39:45,700 >> But Kimono is also easy to use. 557 00:39:45,700 --> 00:39:48,330 As you saw, it's essentially a point and click. 558 00:39:48,330 --> 00:39:50,260 It can scrape JavaScript. 559 00:39:50,260 --> 00:39:53,790 Unfortunately, there's a maximum to how many pages you can scrape. 560 00:39:53,790 --> 00:39:55,710 Sometimes it's a little hard to configure. 561 00:39:55,710 --> 00:39:57,240 It gets confused. 562 00:39:57,240 --> 00:40:00,920 But it's definitely something to consider 563 00:40:00,920 --> 00:40:05,930 if you're not trying to have a super robust maintainable scrape. 564 00:40:05,930 --> 00:40:09,010 If you just want to get everything off of a page quickly, 565 00:40:09,010 --> 00:40:10,970 then Kimono is a really good tool to use. 566 00:40:10,970 --> 00:40:16,490 And as I mentioned before, there's the advanced feature of Kimono 567 00:40:16,490 --> 00:40:19,260 that shows you how to access the unique HTML 568 00:40:19,260 --> 00:40:24,210 element, which is super useful even if you are working in Nokogiri. 569 00:40:24,210 --> 00:40:30,370 >> So if we go to the Kayak site, for example, you can see there is-- 570 00:40:30,370 --> 00:40:31,750 or maybe you can't see. 571 00:40:31,750 --> 00:40:38,910 But if I show you the URL for Kayak, this actually is just the source URL. 572 00:40:38,910 --> 00:40:43,800 This is the URL prior to being modified by whatever JavaScript scripts 573 00:40:43,800 --> 00:40:45,350 that they have going on. 574 00:40:45,350 --> 00:40:52,420 And it's going to look different from inspecting the element. 575 00:40:52,420 --> 00:40:55,940 >> So if you go through and you match up the Inspect Element 576 00:40:55,940 --> 00:41:00,340 code to the source code, it's actually going to be different. 577 00:41:00,340 --> 00:41:05,640 And this is essentially why Nokogiri can't scrape dynamically loaded sites. 578 00:41:05,640 --> 00:41:08,810 Because Nokogiri is scraping the source URL, 579 00:41:08,810 --> 00:41:16,310 whereas Kimono is actually scraping what you're essentially 580 00:41:16,310 --> 00:41:18,260 seeing in Select Element. 581 00:41:18,260 --> 00:41:23,880 >> So if I go through and I try and Kimonofy Kayak, 582 00:41:23,880 --> 00:41:26,600 I can actually go through and select the price. 583 00:41:26,600 --> 00:41:32,360 It's a little harder, and in this case, it's 584 00:41:32,360 --> 00:41:36,600 actually seeing this price as different from these. 585 00:41:36,600 --> 00:41:41,110 So whereas you can configure-- or if this weren't dynamically loaded, 586 00:41:41,110 --> 00:41:43,620 you could configure Nokogiri to get all of these. 587 00:41:43,620 --> 00:41:48,230 >> Because the formatting is slightly different for this listing 588 00:41:48,230 --> 00:41:51,280 as it is compared to the rest of them, and you can see here 589 00:41:51,280 --> 00:41:54,830 it's actually gone and selected all the flight prices. 590 00:41:54,830 --> 00:42:01,200 Maybe I want to select time of flight as well. 591 00:42:01,200 --> 00:42:04,700 And I can go through and sort of configure that. 592 00:42:04,700 --> 00:42:06,950 I don't want that. 593 00:42:06,950 --> 00:42:10,200 I just want the next flight's time. 594 00:42:10,200 --> 00:42:17,030 And then after a couple of these going through, it gets the picture. 595 00:42:17,030 --> 00:42:19,080 So Kimono's pretty smart. 596 00:42:19,080 --> 00:42:21,900 It's just not quite as robust. 597 00:42:21,900 --> 00:42:26,710 >> There are some other alternatives that you can use. 598 00:42:26,710 --> 00:42:31,600 And I'll show you them here. 599 00:42:31,600 --> 00:42:35,790 If you are more comfortable in Python instead of Ruby maybe, 600 00:42:35,790 --> 00:42:39,290 there is a library called Beautiful Soup. 601 00:42:39,290 --> 00:42:40,430 You can use that. 602 00:42:40,430 --> 00:42:42,270 It's very similar to Nokogiri. 603 00:42:42,270 --> 00:42:44,620 It has a few more features. 604 00:42:44,620 --> 00:42:52,160 You can find an HTML tag and then move up or move sideways. 605 00:42:52,160 --> 00:42:54,690 >> There's PyQt. 606 00:42:54,690 --> 00:42:57,820 This can actually scrape dynamic sites, because it's sort of 607 00:42:57,820 --> 00:43:02,540 is a WebKit that pretends to be a browser without there actually 608 00:43:02,540 --> 00:43:03,670 being a browser. 609 00:43:03,670 --> 00:43:07,490 So it would wait for all the JavaScript to load first, and then 610 00:43:07,490 --> 00:43:09,560 go in and try and scrape the site. 611 00:43:09,560 --> 00:43:13,560 >> If you want to stick with Ruby, you can go one level up from Nokogiri. 612 00:43:13,560 --> 00:43:17,650 You can use Capybara with a Poltergeist wrapper. 613 00:43:17,650 --> 00:43:22,910 And this can actually essentially do the same thing 614 00:43:22,910 --> 00:43:26,610 as PyQt, which is it is a WebKit. 615 00:43:26,610 --> 00:43:29,610 It waits for the JavaScript to load first. 616 00:43:29,610 --> 00:43:33,340 If you fiddle around with it enough, you can even get it to click on things. 617 00:43:33,340 --> 00:43:42,780 >> So if there's a link that isn't a classic href where 618 00:43:42,780 --> 00:43:46,350 the path is easily accessible, and it's some JavaScript thing that detects 619 00:43:46,350 --> 00:43:49,490 a click, you can actually do that. 620 00:43:49,490 --> 00:43:53,430 The more popular library to simulate a user 621 00:43:53,430 --> 00:43:56,390 is in JavaScript, which is PhantomJS. 622 00:43:56,390 --> 00:44:01,010 This can obviously scrape dynamic sites because this is essentially 623 00:44:01,010 --> 00:44:04,270 pretending to be Chrome without the user interface. 624 00:44:04,270 --> 00:44:09,970 >> And then, of course the most robust, but slowest option, 625 00:44:09,970 --> 00:44:13,260 is a Selenium browser automation. 626 00:44:13,260 --> 00:44:15,550 And unfortunately, you're not going to be 627 00:44:15,550 --> 00:44:19,770 able to do this within your CS50 IDE. 628 00:44:19,770 --> 00:44:24,140 Because essentially what it does is it boots up your Chrome, 629 00:44:24,140 --> 00:44:27,090 Firefox, whatever browser that you want to use, 630 00:44:27,090 --> 00:44:32,570 and it tracks maybe your mouse movement, whatever you type in, 631 00:44:32,570 --> 00:44:35,170 and it just sort of automates this process. 632 00:44:35,170 --> 00:44:42,070 So it was developed as a sort of website automation testing tool. 633 00:44:42,070 --> 00:44:45,910 But a lot of people use Selenium to scrape websites 634 00:44:45,910 --> 00:44:49,990 that they otherwise have a lot of difficulty scraping 635 00:44:49,990 --> 00:44:53,700 with some of these other, faster tools. 636 00:44:53,700 --> 00:44:57,530 >> So that's all I've got for web scraping. 637 00:44:57,530 --> 00:44:58,090 Have fun. 638 00:44:58,090 --> 00:45:01,762 639 00:45:01,762 --> 00:45:02,680 >> AUDIENCE: Question. 640 00:45:02,680 --> 00:45:04,016 >> ROBERT KRABEK: Yes. 641 00:45:04,016 --> 00:45:12,840 >> AUDIENCE: Is there a mechanism to hash the website so you could basically 642 00:45:12,840 --> 00:45:14,207 go through it later on. 643 00:45:14,207 --> 00:45:15,040 ROBERT KRABEK: Yeah. 644 00:45:15,040 --> 00:45:21,530 So we put the, in our example, for both of them, 645 00:45:21,530 --> 00:45:24,980 we put the entire website into doc. 646 00:45:24,980 --> 00:45:31,260 And so you could actually just take the variable doc and write it to a file. 647 00:45:31,260 --> 00:45:35,490 So if I wanted to, I could write it out as an HTML file, 648 00:45:35,490 --> 00:45:39,280 and then instead of using OpenURI and a cURL request, 649 00:45:39,280 --> 00:45:43,520 then I could just open up doc HTML and then search for that. 650 00:45:43,520 --> 00:45:47,960 >> AUDIENCE: But can you preserve the sort of online experience 651 00:45:47,960 --> 00:45:48,930 while you do offline. 652 00:45:48,930 --> 00:45:51,013 For example. when you're flying for several hours, 653 00:45:51,013 --> 00:45:54,070 I want to basically archive the whole website. [INAUDIBLE] 654 00:45:54,070 --> 00:45:58,780 >> ROBERT KRABEK: Yeah, that's exactly-- so literally what this is doing 655 00:45:58,780 --> 00:46:03,010 is it's taking everything that would be at this URL. 656 00:46:03,010 --> 00:46:11,280 So if we ran cURL, it's taking all of this HTML, 657 00:46:11,280 --> 00:46:14,590 and it's storing it inside the variable doc. 658 00:46:14,590 --> 00:46:17,290 So then you can do whatever you want to do with doc. 659 00:46:17,290 --> 00:46:18,575 You can output it to a file. 660 00:46:18,575 --> 00:46:19,950 AUDIENCE: But it's not linked up. 661 00:46:19,950 --> 00:46:20,780 It's not dynamic. 662 00:46:20,780 --> 00:46:22,770 It's not recursive, right? 663 00:46:22,770 --> 00:46:24,016 You see what I mean? 664 00:46:24,016 --> 00:46:28,359 I'm trying to basically sort of a hash the whole website on my hard drive 665 00:46:28,359 --> 00:46:31,150 so that I could basically do it for several hours without internet. 666 00:46:31,150 --> 00:46:32,025 >> ROBERT KRABEK: Right. 667 00:46:32,025 --> 00:46:37,140 So if I had-- so where's my file I/O? 668 00:46:37,140 --> 00:46:47,766 So this is the file I/O. So say instead of this, I call this craigslist.html. 669 00:46:47,766 --> 00:46:52,620 670 00:46:52,620 --> 00:46:53,940 I'd open that up. 671 00:46:53,940 --> 00:46:59,020 I'd puts doc into it. 672 00:46:59,020 --> 00:47:00,470 I close the file. 673 00:47:00,470 --> 00:47:05,410 And then just because the CS50 IDE is on the cloud, that's whatever. 674 00:47:05,410 --> 00:47:07,710 I can go here. 675 00:47:07,710 --> 00:47:09,320 I can download the file. 676 00:47:09,320 --> 00:47:11,830 And then that would be on my hard drive. 677 00:47:11,830 --> 00:47:13,930 So you can do it that way. 678 00:47:13,930 --> 00:47:18,830 Or if you're at home, not using the CS50 IDE, like Sublime or something, 679 00:47:18,830 --> 00:47:21,900 this is even easier, because this is all available locally, 680 00:47:21,900 --> 00:47:23,020 not tied to the internet. 681 00:47:23,020 --> 00:47:24,720 >> AUDIENCE: I see. 682 00:47:24,720 --> 00:47:26,580 This is for one particular problem. 683 00:47:26,580 --> 00:47:30,410 Can you do it recursively so that you go several layers deep kind of thing? 684 00:47:30,410 --> 00:47:33,801 >> ROBERT KRABEK: I can download folders as well, if that's what you're asking. 685 00:47:33,801 --> 00:47:34,426 AUDIENCE: Yeah. 686 00:47:34,426 --> 00:47:39,890 687 00:47:39,890 --> 00:47:41,440 >> ROBERT KRABEK: Cool. 688 00:47:41,440 --> 00:47:43,182