Lately, I’ve been doing some heavy reading into the future of parallel programming, particular in the realm of data management. The hottest things these days1 is to implement Google’s MapReduce model. There’s multiple vendors who have taken PostgreSQL and manipulated it to add MapReduce functionality for scale-out OLAP data warehouses.
In thinking about how this can be applied to my core competency, Microsoft SQL Server, I did some research. I hope that under the hood, Microsoft is thinking about this model for the next release, but it could probably be implemented now (if you have a strong understanding of the engine) via CLRs. A few Google searches later, and I discovered the Data Miners Blog and their take on it.
Emulating it in raw SQL is very ugly. But in terms of MapReduce’s goal, parallelization, the quesiton is wether this is synchronous or asynchronous processing. SQL Server already has Service Broker, which is designed for a more asynchronous distribution of the work load. My understanding of MapReduce, is that it’s a synchrnous operation. We’d still be waiting for all of the map() funcitons to complete before reduce() can run to sort or aggregate.
Then the obvious applicaiton for a MapReduce applicaiton then is as an SSIS data flow! I don’t have the tools available to me to attempt to implement it logically, but it’d be interesting to attempt
1 Monash, Curt; DBMS2; MapReduce Sound Bites