Volume 10 Issue 3 - May 2018

  • 1. Efficient deduplication using hadoop

    Authors : Manjunath R. Hudagi, Sachin A. Urabinahatti

    Pages : 236-238

    DOI : http://dx.doi.org/10.21172/1.103.40

    Keywords : Cloud storage, Deduplication, Hadoop, Hadoop distributed file system, Hadoop database.

    Abstract :

    In cloud computing, we found that when user uploads the same file twice with same file name it doesn’t allow saving the same file .Also doesn’t allows to saving file with same file name with different content. Hadoop is high-performance distributed data storage and processing system. Hadoop doesn’t provide effective Data Deduplication solution. Assuming a popular video or movie file is uploaded to HDFS by one million users and stored into three million files through Hadoop replication and thus it is wasting of disk space. Through proposed system, only single file spaces are occupied namely reaching the utility of completely removing plicate files. Before uploading data to HDFS we calculate Hash Value of File and store that Hash Value in Database for later use. Now same or other user wants to upload the same file name with same content. An SHA algorithm used to calculate Hash value and verify it to HBase (HBase is called the Hadoop database because it is a NoSQL database that runs on top of Hadoop). Now if Hash Value is matched with stored hash value then it will give message that “File is already exits”.

    Citing this Journal Article :

    Manjunath R. Hudagi, Sachin A. Urabinahatti, "Efficient deduplication using hadoop", Volume 10 Issue 3 - May 2018, 236-238