Apache Sentry – The Open Standard for Unified Authorization in Hadoop

Version 1

    By: Steve Ross (Cloudera) and Ritu Kama (Intel)


    One of the biggest selling points of Hadoop is its flexibility – not only allowing enterprises to store more data for longer duration, but also opening up this data to more users across the company. More and more enterprises are designing their Hadoop clusters to serve as corporate data platforms where data from multiple sources and lines of businesses is aggregated and analyzed. While this has obvious benefits to many organizations, including faster time to value and new and deeper insights, it can also lead to several security concerns. With more data of any type being stored, it can be more difficult to identify what’s potentially sensitive. And by opening this data up to more departments and users, it’s even more crucial to ensure they only have access to the data they need as per their security privileges without inhibiting insights.


    Project Rhino, a blueprint for adding enterprise-grade security to Hadoop, was started by Intel to address the common security concerns in Hadoop, including providing unified authorization. The idea behind unified authorization is to provide an easy and scalable way for administrators to define role-based access controls once, and have it permeate across every access path in Hadoop, including accessing data through a variety of third-party applications – versus having to repeatedly define permissions for each access path for each user.  


    Apache Sentry, an open source tool that’s an integrated part of Cloudera’s security platform, directly addresses this problem and has become the standard for unified authorization in Hadoop. Since its creation in 2012, Sentry has been donated to the Project Rhino initiative and has seen broad contributions from Cloudera, IBM, Intel, and Oracle – ensuring the sustained quality and testing necessary for enterprises, especially for security authorization. Multiple Hadoop vendors including Cloudera, IBM, MapR, and Oracle also ship and support Sentry as part of their platform, which means you can keep your access controls without lock-in, even if you switch vendors.


    With all of these contributors, there have been some major milestones on Sentry, including:

    • Shifting to a database-backed architecture (instead of previous config files)
    • Delegated GRANT and REVOKE
    • Metadata protection (Hive Metastore) in addition to data protection
    • HDFS integration, enabling Sentry permissions to be effective when data is accessed through a variety of components including: MapReduce, Apache Pig, Apache Spark, and others


    Over the past year, Sentry has experienced acceleration in production installations, especially in regulated industries where strong and tightly managed access controls are a must. Companies within financial services, healthcare, insurance, pharmaceuticals, and telecommunications have all deployed Sentry for its unified authorization, including Western Union and SFR.


    Sentry’s popularity across Hadoop has also led to many third-parties integrating with it – ensuring its compatibility not only within the Hadoop ecosystem but with your existing tools as well.


    As mentioned, Sentry’s goal is to provide unified authorization for all Hadoop services. Cloudera and Intel recently made some impressive progress towards this goal by adding HDFS integration to Sentry (available with CDH 5.3), in addition to the already available Hive, Impala, and Search integration. This opens up Sentry permissions across any access paths that are leveraging HDFS and lays critical groundwork for achieving this goal.


    For more information on adding HDFS integrating to Sentry, read “New in CDH 5.3: Apache Sentry Integration with HDFS” or register for the upcoming webinar, “Project Rhino: Enhancing Data Protection for Hadoop.”


    Future work will continue to add more fine-grained permissions for all access paths, while also expanding Sentry’s use for permissions enforcement in third party applications that use Hadoop.