Securing a Hadoop Cluster

Drafted this while responding to a customer query…thought this would be useful for others as well.

This draft summarizes the security available with BigInsight with specific emphasis on BigSQL. The Security concept is true for other Hadoop flavour as well.

Authentication :
When users log in to the InfoSphere® BigInsights™ Console, if their credentials are validated, they gain access to InfoSphere BigInsights Console functionality based on role membership. Three types of authentication are supported: flat file, LDAP and PAM.
1. Flat file
Password encryption with hex or Base64 encodings is supported. The digest encoding is valid only if the digest algorithm is specified. If the digest algorithm is specified and the digest encoding is not specified, the default hex encoding is assumed. If a digest algorithm is not specified, it is assumed that passwords are stored in plain text.
2. LDAP
If you choose LDAP authentication during installation, you configure the InfoSphere BigInsights Console to communicate with an LDAP credential store for users, groups, and user to group mappings.
3. PAM
If you choose PAM authentication during installation, when a user accesses the InfoSphere BigInsights Console, the username and password are passed to the InfoSphere BigInsights PAM module.
You can also use PAM with LDAP so that when a user logs in, InfoSphere BigInsights communicates with your PAM module, and then your PAM module communicates with your LDAP server to authenticate users.
Authentication for Big SQL
Big SQL uses the BigInsights Console to authenticate users .

Authorization
BigInsights supports four predefined roles. For each role, there is a set of actions that can be performed in the cluster.
• BigInsightsSystemAdministrator. – Performs all system administration tasks. Forexample, a user in this role can perform monitoring of the cluster’s health, and adding,removing, starting, and stopping nodes.
• BigInsightsDataAdministrator. -Performs all data administration tasks. For example, these users create directories, run Hadoop file system commands, and upload, delete, download, and view files.
• BigInsightsApplicationAdministrator. -Performs all application administration tasks, for example publishing and unpublishing (deleting) an application, deploying and removing an application to the cluster, configuring the application icons, applying application descriptions, changing the runtime libraries and categories of an application, and assigning permissions of an application to a group.
• BigInsightsUser, -Runs applications that the user is given permission to run and views the results, data, and cluster health. This role is typically the most commonly granted role to cluster users who perform non-administrative tasks.

Authorisation for the Big SQL server
The authorization modes use the Hadoop-based Distributed File System (HDFS) permissions to control data access.
• none
Using the none authorization mode means that all operations inside the Big SQL server are done using a super-user or admin user authority, regardless of which user is connected to the server. This mode is typically used when security is not much of a concern, such as with test systems.
• dataSource
In the dataSource mode, if the currently connected user is an admin, then the Big SQL server uses the same user ID as the Big SQL server process owner, typically biadmin, to perform the Reading, Writing to HDFS and Spawning a map-reduce jobs.
If the currently connected user is not an admin, then the above operations are performed using the currently connected users’s user ID. For example, a user with user ID user1 can query a table if user1 had read permission for schema-directory, table-directory, and table-files via the owner’s, group’s, or other’s read permission.
Users can share or restrict the data access by changing the HDFS ownership and HDFS permissions of the table schema directory. All users who connect to the Big SQL server must have a valid user login to the Big SQL server. All users also need an entry in the InfoSphere® BigInsights™ Console for one of the authentication mechanisms: no-security, flat-file security, LDAP, or PAM. This entry provides the Big SQL server with the user ID and password for the relevant users.

If two users share a table in such a way that user-1 creates the table and user-2 loads the data into the table, then user-2 requires the write permission on the parent directory for the table. After the table data is loaded, the table is owned by user-2. If user-1 must own the table, then user-2 can change the permission on the table to make user-1 the owner of the table.

Similarly users can be restricted to access a particular tables . A snapshot below for access denied.

Accounting
InfoSphere BigInsights stores security audit information as audit events in its own audit log files for general security tracking. As part of core Hadoop, HDFS and MapReduce provide basic audit report. Additionally, You can configure InfoSphere BigInsights to send audit log events to InfoSphere Guardium for security analysis and reporting. After InfoSphere BigInsights events exist in the InfoSphere Guardium repository, other InfoSphere Guardium features such as workflow to email and track report signoff, alerting, and reporting are available.
The security audit information that InfoSphere BigInsights generates depends on your environment. The following list is representative of the type of information that InfoSphere BigInsights generates:
Hadoop Remote Procedure Calls (RPC) authentication and authorization successes and failures
• Hadoop Distributed File System (HDFS) file and permission-related commands such as cat, tail, chmod, chown, and expunge
• Hadoop MapReduce information about jobs, operations, targets, and permissions
• Oozie information about jobs
• HBase operation authorization for data access and administrative operations, such as global privilege authorization, table and column family privilege authorization, grant permission, and revoke permission
Data Protection
Data protection ensures privacy and confidentiality of information.
• The InfoSphere BigInsights installer provides the option to configure HTTPS to potentially provide more security when a user connects to the BigInsights web console. If HTTPS is selected, the Secure Sockets Layer and Transport Layer Security (SSL/TLS) protocol provides security for all communication between the browser and the web server.
• We can enable SSL encryption for the data exchange between clients and the IBM Big SQL server.
• Basic Authentication support for HttpFS server. (Hadoop HDFS over HTTP).

Summary
The InfoSphere® BigInsights™ security architecture includes authentication, roles and authorization levels, HTTPS support for the InfoSphere BigInsights console, and reverse proxy.

Leave a comment