I'm working on a web crawler and I want to use HttpWebRequest. it allows asynchronous operations such as BeginGetResponse, but connecing using HttpWebRequest.Create isn't asynchronous - and I want to make about 1,000 connections simultaneously, so using this method (with an extra thread for asynchronous) I can't even get 2 connections because until the second one connects the first connection already finished downloading content, and it's almost as if I connected to the web page after page instead of simultaneously.
I was wondering if I there's a good way to connect about 1,000 times using HttpWebRequest without creating tons of threads or anything...
Thanks in advance.
Edit: Eventually it wasn't the HttpWebRequest that was slow and blocking, it was the BeginGetResponse - it's blocking until the request headers are sent? how can I bypass this, use asynchronous send as well with BeginGetRequestStream?
Are all these connections going to the same domain?
Try adding this to your app/web.config
<system.net> <connectionManagement> <add address="*" maxconnection="1000" /> </connectionManagement> </system.net>
I don't think you can make multiple connections on the same thread. You need one thread per connection. But you can modify your design to make it more scalable.
You can make one control thread which does all the heavy lifting (or maybe several of these) and every such control thread spaws several child threads which go out and get the data and put them in some kind of array inside the parent class. Then the control class can recycle the child threads. Once a child thread is finished, it gets another "task". The main idea, IMHO, is to seperate the crawling from the processing of the retrieved data. Get it, store it and process it later.
Hope this helps in some way :)
There is no reason that this should be blocking. There are some oddities about how asynchronous web requests work which could force your supposed asynchronous requests to be synchronous. For starters, if you are actually posting data, you must use BeginGetRequestStream (you cannot mix asynch and synch) see: http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.begingetrequeststream.aspx
If I recall correctly nothing actually happens with WebRequest.Create, it just sets up the object, the request doesn't start until either BeginGetRequestStream or BeginGetResponse (depending if it's a post or get).
Another big note, in my findings, there is a lot more delay with reading the stream which comes from EndGetResponse than there is from the request. You should also use the asynchrnous version of read on the stream.