Writes and Streaming
HWC supports batch writes and Structured Streaming writes into Hive ACID tables.
Batch writes (DataFrame writer)
HWC writes stage files to HDFS and then issues LOAD DATA into the target table. The read mode does not affect
write behavior.
spark.range(0, 100)
.selectExpr("id", "concat('v', id) as v")
.write
.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector")
.option("database", "hwc_it")
.option("table", "t_acid")
.mode("overwrite")
.save()
Notes:
- Create the target table first (Spark 3 does not auto-create Hive tables on write).
- Use a fully qualified staging directory for secure access and writes.
Streaming writes (Structured Streaming)
Use the streaming sink to write to ACID tables:
import org.apache.spark.sql.streaming.Trigger
val q = spark.readStream
.format("rate")
.option("rowsPerSecond", 5)
.load()
.selectExpr("cast(timestamp as string) as ts", "value")
.writeStream
.format("com.hortonworks.spark.sql.hive.llap.streaming.HiveStreamingDataSource")
.outputMode("append")
.option("database", "hwc_it")
.option("table", "t_stream")
.option("metastoreUri", "thrift://hms-host:9083")
.option("checkpointLocation", "hdfs://nameservice/tmp/hwc_ckpt")
.trigger(Trigger.Once())
.start()
Streaming notes:
- The target table must be transactional (ACID).
metastoreUriis required for streaming.- Use
cleanUpStreamingMetato remove metadata for a stopped query.