Components of Hadoop


The previous article has given you an overview about the Hadoop and the two components of the Hadoop which are HDFS and the Mapreduce framework. This article would now give you the brief explanation about the HDFS architecture and its functioning.


The Hadoop Distributed File System(HDFS) is self-healing high-bandwidth clustered storage. HDFS has a master/slave architecture. An HDFS cluster constitutes of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition there are number of datanodes usually one per node in the cluster, which manages the storage attached to the nodes that they run on.

HDFS exposes a file system namespace and allows user data to be stored in files. Internally a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing and renaming files and directories.

It also determines the mapping of blocks to DataNOdes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion and replication upon instruction from the NameNode.


The above sketch represents the architecture of the HDFS.


The other concept and component of the Hadoop is the Mapreduce. The Mapreduce is distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction.

It is parallel data-processing framework. The Mapreduce framework is used to get the data out from the various files and data nodes available in a system.The first part is that the data has to be pushed onto the different servers where the files would get replicated in short it is to store the data.

The second step once the data is stored the code is to be pushed onto the Hadoop cluster to the namenode which would be distributed on different datanodes which would be becoming the compute nodes and then the end user would be receiving the final output.

Mapreduce in Hadoop is not only the one function happening, there are different tasks involved like record reader, map, combiner, partition-er, shuffle and sort and reduce the data and finally gives the output. It splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner.

The framework sorts the outputs of the maps, which are then pushed as an input to the reduced tasks. Typically both the input and the output of the job are stored in a file-system. The framework also does take care of scheduling, monitoring them re-executing the failed tasks.

Mapreduce Key-value pair:

Mappers and reducers always use key-value pairs as input and output. A reducer reduces values per key only. A mapper or reducer may emit 0,1 or more key value pairs for every input. Mappers and reducers may emit any arbitrary keys or values,not just subsets or transformations of those in the input.


def map(key, value, context)

value.to_s.split.each do |word|

word.gsub!(/\W/, ”)


unless word.empty?





def reduce(key, values, context)

sum = 0

values.each { |value| sum += value.get }



Mapper method splits on whitespace, removes all non-word characters and downcases. It outputs a one as the value. Reducer method is iterating over the values, adding up all the numbers, and output the input key and the sum.

Input file: Hello World Bye World

output file: Bye 1

Hello 1

World 2


Hence ends the briefing about the components of the Hadoop, their architecture, functioning and also steps involved in different processes happening in both the systems of the Hadoop.

There are also some pros and cons of the Hadoop similarly like a coin consisting of two faces which would be discussed in the coming blogs. Complete knowledge of any concept can be possible only once you get to know about the merits and demerits of the particular concept.

Henceforth to acquire complete knowledge of Hadoop keep following the upcoming posts of the blog.

 The Two Faces of Hadoop


  • Hadoop is a platform which provides Distributed storage & Computational capabilities both.
  • Hadoop is extremely scalable, In fact Hadoop was the first considered to fix a scalability issue that existed in Nutch – Start at 1TB/3-nodes grow to petabytes/1000s of nodes.
  • One of the major component of Hadoop is HDFS (the storage component) that is optimized for high throughput.
  • HDFS uses large block sizes that ultimately helps It works best when manipulating large files (gigabytes, petabytes…)
  • Scalability and Availability are the distinguished features of HDFS to achieve data replication and fault tolerance system.
  • HDFS can replicate files for specified number of times (default is 3 replica) that is tolerant of software and hardware failure, Moreover it can automatically re-replicates data blocks on nodes that have failed.
  • Hadoop uses MapReduce framework which is a batch-based, distributed computing framework, It allows paralleled work over a large amount of data.
  • MapReduce let the developers to focus on addressing business needs only, rather than getting involved in distributed system complications.
  • To achieve parallel & faster execution of the Job, MapReduce decomposes the job into Map & Reduce tasks and schedules them for remote execution on the slave or data nodes of the Hadoop Cluster.
  • Hadoop do has capability to work with MR jobs created in other languages – it is called streaming
  • suited to analyzing big data
  • Amazon’s S3 is the ultimate source of truth here and HDFS is ephemeral. You don’t have to worry about reliability etc – Amazon S3 takes care of that for you. Also means you don’t need high replication factor in HDFS.
  • You can take advantage of cool archiving features like Glacier.
  • You also pay for compute only when you need it. It is well known that most Hadoop installations struggle to hit even 40% utilization [3],[4]. If your utilization is low, spinning up clusters on demand may be a winner for you.
  • Another key point is that your workloads may have some spikes (say end of the week or month) or may be growing every month. You can launch larger clusters when you need to and stick with smaller ones otherwise.
  • You don’t have to provision for peak workload all the time. Similarly, you don’t need to plan your hardware 2-3 years upfront as is common practice with in-house clusters. You can pay as you go, grow as you please. This reduces the risk involved with Big Data projects considerably.
  • Your administration costs can be significantly lower reducing your TCO.
  • No up-front equipment costs. You can spin up as many nodes as you like, for as long as you need them, then shut it down. It’s getting easier to run Hadoop on them.
  • Economics – Cost per TB at a fraction of traditional options.
  • Flexibility – Store any data, Run any analysis.


  • As you know Hadoop uses HDFS and MapReduce, Both of their master processes are single points of failure, Although there is active work going on for High Availability versions.
  • Until the Hadoop 2.x release, HDFS and MapReduce will be using single-master models which can result in single points of failure.
  • Security is also one of the major concern because Hadoop does offer a security model But by default it is disabled because of its high complexity.
  • Hadoop does not offer storage or network level encryption which is very big concern for government sector application data.
  • HDFS is inefficient for handling small files, and it lacks transparent compression. As HDFS is not designed to work well with random reads over small files due to its optimization for sustained throughput.
  • MapReduce is a batch-based architecture that means it does not lend itself to use cases which needs real-time data access.
  • MapReduce is a shared-nothing architecture hence Tasks that require global synchronization or sharing of mutable data are not a good fit which can pose challenges for some algorithms.
  • S3 is not very fast and vanilla Apache Hadoop’s S3 performance is not great. We, at Qubole, have done some work on Hadoop’s performance with S3 filesystem .
  • S3, of course, comes with its own storage cost .
  • If you want to keep the machines (or data) around for a long time, it’s not as economical a solution as a physical cluster.

Here ends the briefing of Big Data and Hadoop and its various systems and their pros and cons. Wish you would have got an overview about the concept of Big Data and Hadoop.

Source: RailsCarma



Scaling Applications with Multiple Database Connection

Business requirements keep changing day by day and we always keep optimizing or scaling our applications based on the usage, new feature additions or subtractions. Over all the agile development adds challenges every now and then.

Applications depending on databases can be scaled by separating the database layer and scaling it independently. The OPS team does take care of such infrastructure changes based on the application deployment architecture.

As a programmer, we can configure our application to work with multiple databases. In this document we are going to explain how we can achieve this in a Rails application.

There are 3 different ways to connect extra database to an application

  1. Set-up database.yml
  2. Direct connection
  3. Writing in module

1. Set-up database.yml:

As we know database.yml will be having 3 database connection by default for development, test and production. We can connect another database to all three environments by adding the code shown below.

  adapter: adapter_name (mysql2, postgresql, oracle, Mssql, etc.,)

  database: database_name_development

  user_name: user_name

  password: ******


  adapter: adapter_name (mysql2, postgresql, oracle, Mssql, etc.,)

  database: database_name_test

  user_name: user_name

  password: ******


  adapter: adapter_name (mysql2, postgresql, oracle, Mssql, etc.,)

  database: database_name_production

  user_name: user_name

  password: ******

After setting up database.yml we can connect it in 2 ways based on the below cases

  • Known database structure
  • Un-known database structure

Known database structure:

If we are aware of the database structure, we can create models for each and we can establish the connection in the model.


class OtherTable < ActiveRecord::Base

  self.abstract_class = true

  establish_connection “other_#{Rails.env}”


This can also be inherited by another model

class Astronaut < OtherTable

  has_many :missions

  has_many :shuttles, through: :missions


Un-known database structure:

When we don’t know the database structure we can write only one model and we can make the connection to it. We can do the crud based on the dynamic parameters.


class ExternalDatabaseConnection < ActiveRecord::Base

  self.abstract_class = true # this class doesn’t have a table



  1. Direct connection:

In case 2nd database has not much importance and is used in one or two places we can directly call the

ActiveRecord::Base.establish_connection with credentials and we can interact with that database.



:username =>"user_name",:password => "*********",:database =>  "database_name")

  1. Writing in module:

We can also connect the database from module and included in model as shown below.


module SecondDatabaseMixin

  extend ActiveSupport::Concern

  included { establish_connection “other_#{Rails.env}” }


External database connection:

Database to be connected can be exists on any server. In case it is not on the same server we can give host as IP address of the server where it exists.


adapter:  adapter_name (mysql2, postgresql, oracle, Mssql, etc.,)

  host:  external_db_server_ip (

  username:  user_name

  password:  *******

  database:  db_name

Note: There are few gems available to  magic_multi_connections, Db-charme etc…

Pros and cons:


  • If the application has multiple clients and each wants a different database for their customers.
  • Helps in backups for each client.
  • Another database may be used in another application which may have different adapter.
  • When users report that access is slow, easy to know which DB is causing the trouble.


  • If application is simple with less users
  • Maintenance of code for the rest if any changes on database structure.

Source: RailsCarma

A Simple Way To Increase The Performance Of Your Rails App

Developing a application and optimizing its performance should go hand in hand but mostly the due to achieve the deadlines and complete the features in a project the scope of optimization reduces and the optimization is kept for the end (which is not a good practice) and other factor such as lack of experience and substandard coding also lead to decrease in the performance.

There are many way to increase the performance and code your application, it is a vast topic and subjective to the application your are development will be discussing small changes that can be done, which will change performance and improve it.

The main areas to concentrate to improve you app performance while developing :

  1. Database optimization and Query optimization
  2. JavaScript, CSS optimization

Lets concentrate on database optimization and here are few quick tips (again these are basic tweaks which we feel should be applied & are subjective to applications and developers)

  • Maintaining proper indexing for required tables in the db (don’t over do indexing ,it may also lead to lowering performance. You can decide this on case to case basis)
  • Maintaining proper relation and association between models also is major factor effecting the app performance and proper utilization of association will increase the performance.
  • Fetch only when it is required and only what is required and reuse the fetched data from the db as much as possible.
  • Optimize the query by limiting the data fetched and fetch data in batches for large amount of data.
  • Database caching can be used to reduce the response time and number of queries . We can achieve it by implementing memcached and dalli gem.
  • Do not write queries in loop, Its the biggest Don’t do while coding. .If its already done find a way to rewrite that part of code and avoid the calling query in a loop situation.

These are the few point can be looked in to for optimizing the rails application. We would also recommend Bullet gem in development which is very useful to reduce N+1 queries in application .

The project is available on GitHub:
If you have any other pointers to add to this, feel free to comment. We shall take them while writing our next set of articles.

Source: RailsCarma