Sunday, January 2, 2011

Introduction to distributed programming and hadoop


        Are you aware that multi-cores are very difficult to built ?? Do you know all leading companies do not just depend on multi-cores? But where do these organizations get the computing power from?

      The simple and straight-forward answer to all these questions lies in distributed computing. In parallel computing the processing power is dependent on the number of processors used. But this makes the design of the processor complicated. The more processing power we require the more complicated it becomes. By distributing the processing power among several independent systems the design of a distributed system becomes simple. The following are the advantages of a distributed system over a parallel system.
  • No single point of failure.
  • Easy to scale up or down.
  • Speed. 

    How is a distributed computing platform implemented?
     
    There are two methods for implementing a distributed computing platform.
    • Using a heterogeneous(master-slave) configuration.
    • Using a homogeneous configuration.
    The heterogeneous configuration is generally followed as it is much easier to implement and use. In this configuration every system must do the action specified by a central server or master. The master just assigns the work. The slaves must do the assigned work and return the result to the master. Note that the work assigned by the master to the slaves may not be just the processing work. For example the work may storage of a large file. The master assigns a part of this work to every slave. Thus every slave will write the part of a file to disk and finally notifies the master about the disk locations of that part of the file. The master will now have all the disk locations of that large file.  Thus the work of storing a large file can be distributed.
      
    Apache Hadoop(The Open Source Distributed Programming Platform)

    Hadoop is a distributed computing platform. It is built by the Apache Software foundation group. It was developed using the heterogenous configuration of the server systems as discussed above. Hadoop was initially built by Doug Cutting  in java after which it was taken over by Apache. Many companies like yahoo, facebook have been using Hadoop in their businesses. The design of Hadoop is based on a white paper submitted by google titled "Map Reduce: Simplified Data Processing on Large Clusters".