Querying HBase using Spark

Query HBase using Spark.

For more information and examples, see HBase Example Using HBase Spark Connector.
  1. Provide the Spark user to perform CRUD operation in HBase using "hbase" user:
    sudo -u hbase bash
    kinit -kt /etc/security/keytabs/hbase.headless.keytab <Spark-user>  
    hbase shell
    grant 'spark', 'RWXCA'
    exit
  2. Sign-in to Ranger.
  3. Select the HBase service.
  4. Add or update policy to give access "create,read,write,execute" to the Spark user.
  5. Sign-in with Spark user account and create a table in HBase:
    sudo su spark  
    (kinit with spark if required)
    hbase shell
    hbase(main):001:0> create 'person', 'p', 'c'
  6. Start spark-shell:
    spark-shell --jars
          /usr/lib/hbase/hbase-spark.jar,/usr/lib/hbase/hbase-spark-protocol-shaded.jar,/usr/lib/hbase/*
          --files /etc/hbase/conf/hbase-site.xml --conf
          spark.driver.extraClassPath=/etc/hbase/conf
  7. Insert and read data using spark-shell:
    • Inserting data:
      val sql = spark.sqlContext
       
      import java.sql.Date
       
      case class Person(name: String,
      email: String,
      birthDate: Date,
      height: Float)
       
      var personDS = Seq(
      Person("alice", "alice@alice.com", Date.valueOf("2000-01-01"), 4.5f),
      Person("bob", "bob@bob.com", Date.valueOf("2001-10-17"), 5.1f)
      ).toDS
       
      personDS.write.format("org.apache.hadoop.hbase.spark")
      .option("hbase.columns.mapping",
      "name STRING :key, email STRING c:email, " +
      "birthDate DATE p:birthDate, height FLOAT p:height")
      .option("hbase.table", "person")
      .option("hbase.spark.use.hbasecontext", false)
      .save()

      Results:

      shell> scan 'person'
      ROW       COLUMN+CELL
       alice    column=c:email, timestamp=1568723598292, value=alice@alice.com
       alice    column=p:birthDate, timestamp=1568723598292, value=\x00\x00\x00\xDCl\x87 \x00
       alice    column=p:height, timestamp=1568723598292, value=@\x90\x00\x00
       bob      column=c:email, timestamp=1568723598521, value=bob@bob.com
       bob      column=p:birthDate, timestamp=1568723598521, value=\x00\x00\x00\xE9\x99u\x95\x80
       bob      column=p:height, timestamp=1568723598521, value=@\xA333
      2 row(s)
    • Reading data back:
      val sql = spark.sqlContext
      
      val df = sql.read.format("org.apache.hadoop.hbase.spark")
       .option("hbase.columns.mapping",
         "name STRING :key, email STRING c:email, " +
           "birthDate DATE p:birthDate, height FLOAT p:height")
       .option("hbase.table", "person")
       .option("hbase.spark.use.hbasecontext", false)
       .load()
      df.createOrReplaceTempView("personView")
      
      val results = sql.sql("SELECT * FROM personView WHERE name = 'alice'")
      results.show()