Saturday, September 29, 2012

The Past Weeks. Building a Multi Threaded Scraper

I’ve been silent for the past bunch of weeks. No, this isn’t one of those ‘I’m back to posting’ blog posts. This is just telling where I have been and what to expect very soon if I’m allowed to actually post about it which I really do hope I am since the communities out there actually helped me so much in solving the problems. When I say communities I mean other people’s blogs and such. Which is really awesome when you consider that the information out there was so vast that I didn’t have to go to a single #python irc channel to get more information. That and the documentation on python is brilliant.

So, what is it that I’ve been so busy with over the last couple of weeks. Well, anything.lk has gotten a partnership of sorts going with a business to help increase the number of items we can supply our fans and loyal customers with. Why? Because we believe that so much more can be done in this business. (Sorry but I can’t divulge the details just yet. That’s another reason why I’ve been so silent on the work that I’ve been doing). But how many items you ask? That’s a good question. The answer is anywhere between 2 million to 9 million items. The only problem was, due to the size and the nature of the partnership we would have to do the heavy lifting in order to get these products to us. Since they run an online site, the best way to do this and to keep our list up to date with their one would be to run a scraper across it. Not a crawler. A scraper.

Building this wasn’t easy. The tools I used were python and BeautifulSoup. During my first naïve implementation of the scraping I was taking about 2-3 seconds per item to be scanned. If I was to run the scraper across 2 million items that would take me 57 days at 2.5 seconds per scan.  Not good. The next thing to do was to build the scraper in a multi threaded fashion. At around 100 threads working I discovered the sweet spot between enough threads and speed and left it at that. The scrape time was brought down to 6 days. From that point onwards it was running a few trial scrapes. What followed was me discovering just how memory intensive this was going to get. In under 10 minutes I was eating into 3+GB of RAM and the usage wasn’t slowing. Several hard crashes later I realized that I would have to severely cut down on the number of objects I could have in the memory. I limited to running an insertion into the database every 100 records, and thus brought down the memory usage to just 70 MB RAM max. This was I think one of the biggest highlights for me personally. From there, I had to further optimize how the connections were taking place. The problem that I had before was that I was waiting for 100 Queue objects to be cleared before dumping the next 100 in. Why? Because I had made a bad design decision to not put a cap on the number of threads being used inside the scraper. So if I didn’t do a q.join() every 100 items I would have been in serious trouble where I’d be spawning 2 million threads in a matter of seconds. This was mostly because I was using both threads and Queues for the first time so I wasn’t too sure about the logical decisions I should make till much later.

Thus, the next decision made was to limit the number of threads, and keep a maxsize on the Queue so that it didn’t grow to two million objects either all at once. This way, as soon as a page was scraped the next page was slotted in instead of waiting for 100 to finish to get its chance. The result of this? 6 days to 3 days used for scraping.

And that, was what I was doing the past few weeks. I haven’t spoken about all the other little fail safes that I had to put in. There were a number of failures from the site being scraped and the internet connection at the office that I had to factor in. The scraper was launched yesterday in the evening. I have no idea what’s happening with it right now. But if it has for some reason broken due to bad coding I’ll need to build in another fail safe for starting it up and getting it to pick up where it left off. That’s one thing I haven’t had time to build given the constraint that the scraped items need to make their way to our site pretty soon. Pray I shall then I guess.

No comments:

Post a Comment