Saturday, September 29, 2012

The Past Weeks. Building a Multi Threaded Scraper

I’ve been silent for the past bunch of weeks. No, this isn’t one of those ‘I’m back to posting’ blog posts. This is just telling where I have been and what to expect very soon if I’m allowed to actually post about it which I really do hope I am since the communities out there actually helped me so much in solving the problems. When I say communities I mean other people’s blogs and such. Which is really awesome when you consider that the information out there was so vast that I didn’t have to go to a single #python irc channel to get more information. That and the documentation on python is brilliant.

So, what is it that I’ve been so busy with over the last couple of weeks. Well, has gotten a partnership of sorts going with a business to help increase the number of items we can supply our fans and loyal customers with. Why? Because we believe that so much more can be done in this business. (Sorry but I can’t divulge the details just yet. That’s another reason why I’ve been so silent on the work that I’ve been doing). But how many items you ask? That’s a good question. The answer is anywhere between 2 million to 9 million items. The only problem was, due to the size and the nature of the partnership we would have to do the heavy lifting in order to get these products to us. Since they run an online site, the best way to do this and to keep our list up to date with their one would be to run a scraper across it. Not a crawler. A scraper.

Building this wasn’t easy. The tools I used were python and BeautifulSoup. During my first na├»ve implementation of the scraping I was taking about 2-3 seconds per item to be scanned. If I was to run the scraper across 2 million items that would take me 57 days at 2.5 seconds per scan.  Not good. The next thing to do was to build the scraper in a multi threaded fashion. At around 100 threads working I discovered the sweet spot between enough threads and speed and left it at that. The scrape time was brought down to 6 days. From that point onwards it was running a few trial scrapes. What followed was me discovering just how memory intensive this was going to get. In under 10 minutes I was eating into 3+GB of RAM and the usage wasn’t slowing. Several hard crashes later I realized that I would have to severely cut down on the number of objects I could have in the memory. I limited to running an insertion into the database every 100 records, and thus brought down the memory usage to just 70 MB RAM max. This was I think one of the biggest highlights for me personally. From there, I had to further optimize how the connections were taking place. The problem that I had before was that I was waiting for 100 Queue objects to be cleared before dumping the next 100 in. Why? Because I had made a bad design decision to not put a cap on the number of threads being used inside the scraper. So if I didn’t do a q.join() every 100 items I would have been in serious trouble where I’d be spawning 2 million threads in a matter of seconds. This was mostly because I was using both threads and Queues for the first time so I wasn’t too sure about the logical decisions I should make till much later.

Thus, the next decision made was to limit the number of threads, and keep a maxsize on the Queue so that it didn’t grow to two million objects either all at once. This way, as soon as a page was scraped the next page was slotted in instead of waiting for 100 to finish to get its chance. The result of this? 6 days to 3 days used for scraping.

And that, was what I was doing the past few weeks. I haven’t spoken about all the other little fail safes that I had to put in. There were a number of failures from the site being scraped and the internet connection at the office that I had to factor in. The scraper was launched yesterday in the evening. I have no idea what’s happening with it right now. But if it has for some reason broken due to bad coding I’ll need to build in another fail safe for starting it up and getting it to pick up where it left off. That’s one thing I haven’t had time to build given the constraint that the scraped items need to make their way to our site pretty soon. Pray I shall then I guess.

Monday, September 3, 2012

Quick Post: Passing Parameters for an SQL Statement with IN clause using SQLCommand (C#)

There are probably a variety of ways to solve this little problem as this Stack Overflow thread will show but I thought that showing how I solved this would still be worth it. All I do essentially is build a string, and based on how many parameters there are to insert into the SQL statement I insert the ‘?’ with a comma after it and once I exit the loop I insert one more ‘?’. Note that the number of times I run in the loop in one less than the actual length of the parameter array. Once I’m done building the string I then iterate through the parameter array and add each parameter in. This code is tested and working.

StringBuilder selectstring = new StringBuilder();             selectstring.Append("SELECT ROWID, * FROM tbl1 WHERE ROWID IN (");             Int64[] arr = {1,2,3,3,4};             for (int i = 0; i < arr.Length-1; i++)             {                 selectstring.Append("?,");             }             selectstring.Append("?)");                          sa.SelectCommand = new SQLiteCommand(selectstring.ToString(), conn);             foreach (Int64 a in arr)             {                 sa.SelectCommand.Parameters.Add(new SQLiteParameter(DbType.Int64, a));             }                          System.Data.DataSet ds = new System.Data.DataSet();             sa.Fill(ds);             conn.Close();

Saturday, September 1, 2012

Doubles and Decimals in C#

While I say C#, from what I’ve researched on floating point types so far, this concept could apply to most languages.

If you’ve ever programmed using double type values in your program I can safely say at this point that you are probably doing it wrong. You may never notice it but doing work with doubles is at some point going to give you an error that you may never notice until the day some critical function is supposed to happen based on a value and then you find the critical function isn’t getting triggered. Murphy’s law. Believe in it.

Where does this spawn from? While building a system for very very basic data analysis, I needed an even more basic addition of numbers to ensure that the total of values in a column was equal to 100 before allowing the user to progress to the next step. I was testing the program out and everything seemed fine, and just as I was about to roll the system out, I had a few more values to change in the database. Instead of doing it manually I thought I’d test the system again and do it through there. (Why is this important? Because it’s incredible where some things can go unnoticed through several hours of complete testing). While all this time I had been adding whole numbers, in this case I needed to actually add decimal numbers. Somehow, even though a basic on paper calculation gave me 100, the program was saying values do not add up to a 100. In debug mode, I discovered that at one point, 22.4+23.7 was giving me 46.9999999939. What. The. Debug??

Reading up through the documentation and several threads on the internet enlightened me that double values aren’t meant to be precise. They are meant for speed. They can be affected by the strangest things such as directX interfering with the bits. It’s scary, but it is what it is. The solution? Use decimal types. Almost always, you are going to use the slightly heavier but absolutely precise decimal types. DO NOT use floating point type variables.

But that begs the question. What are double types useful for? Sure, they are faster but where would you use them? Essentially in anything that needs speed while allowing for a very small percentage of error. Redrawing sprites based on their screen positions can use double to store the vector coordinates. Believe it or not, extremely large dataset mining, to get a general trend is a likely application. But essentially, for us who write applications for everyday use, we need precise values and therefore keep it in mind, decimal types.