深入剖析 HIVE 的锁和事务机制

1 从调度系统对 HIVE 作业的调度策略聊起

大家知道，依托 HIVE 构建数据仓库时，我们一般会通过调度系统精心编排 HIVE SQL 作业，以避免多个作业并发写同一张表或同一个表分区；同时 AD HOC 的数据查询操作，在时间上一般也会尽量避开常规的 ETL 数据写操作，以避免对同一张表或同一个表分区的并发读写操作。

调度系统之所以对 HIVE SQL 作业采取上述调度策略，究其原因，是因为多作业并发读写同一个表或同一个表分区时，底层会因为 HIVE 的锁机制导致大量锁竞争和等待，此时作业运行效率极低，所以业务应用只能通过调度系统规避掉对同一张表或表分区的并发读写。

2. 一个因为调度系统配置不当导致的 HIVE 作业运行效率低下的线上例子

笔者最近排查的某证券客户的线上应用运行效率低下的原因，就是上述 HIVE SQL 作业调度策略不当，在此跟大家分享下。

问题现象：在进行 HIVE 作业的性能分析时，从 HIVE ON SPARK 作业的 SPARK WEB UI 上，我们观察到，某个 SQL 作业对应的一系列 SPARK 任务执行完毕后，大部分 EXECUTORs 都被动态释放归还给了 YARN，但是后续过了七八分钟又动态地申请了新的一批 EXECUTORS，以执行后续的 SQL 作业，这是不正常的，因为虽然我们开启了 SPARK ON YARN 的动态资源分配机制，但是前后两个 SQL 作业是在同一个 SESSION 会话中顺序提交的，理想状况下，SPARK EXECUTORS 不应该动态释放又重新申请，除非前后两个 SQL 的提交/执行间隔时长，大于参数 spark.dynamicAllocation.executorIdleTimeout 的值（该参数默认值 60s）
问题原因：经过排查应用日志和YARN 日志，发现前后两个 SQL 确实是顺序提交的,其提交间隔并没有大于60s，同时 YARN 集群中对应的资源队列，当时也是有充足的资源的；进一步排查 HIVESERVER2 日志，发现 HS2 虽然在上一个 SQL 执行完毕后，及时收到了客户端提交的新的 SQL 并成功完成了解析和编译，但真正生成 SPARK 任务并提交执行，却是在七八分钟之后！正是由于这七八分钟的时间间隔，触发了SPARK ON YARN 动态资源分配机制下的 executor 的动态释放和申请；
问题根本原因：仔细分析 HS2 中详细的日志，结合业务方编写的 HIVE SQL 业务代码，发现上述 SQL 的解析和对应 SPARK 作业的执行之间的七八分钟的时间间隔，正是由于 HIVE 表分区的锁竞争引起的: 业务方编写了多个 HIVE SQL 作业，这些 SQL 作业通过 insert into 插入到了同一张表的同一个分区，在配置调度系统时，这些 HIVE SQL 作业被配置在了不同的流程中并被并发地调度执行，所以此时这些被并发调度的多个 HIVE SQL 作业在底层会因为对同一个表分区的的锁竞争引起等待，在上述案列中达到了七八分钟，触发了 spark executor 的动态释放和申请；
问题解决方案：更改业务 SQL 和调度逻辑，不再多个任务并发写同一个表分区，比如先分别写数据到多个表例如 report_data1-5，最后再把数据 union 插入到 report_data 表.

3 HIVE 的锁和事务机制

3.1 HIVE 传统的锁机制

HIVE 传统的锁机制，概括如下：

非分区表：For a non-partitioned table:
- When the table is being read, a S lock is acquired; 读取操作会获取S锁
- whereas an X lock is acquired for all other operations (insert into the table, alter table of any kind etc.) 其它操作会获取X锁
分区表：For a partitioned table, the idea is as follows:
- A 'S' lock on table and relevant partition is acquired when a read is being performed; 读取操作或获取表和相关分区的S锁；
- For all other operations, an 'X' lock is taken on the partition. However, if the change is only applicable to the newer partitions, a 'S' lock is acquired on the table, whereas if the change is applicable to all partitions, a 'X' lock is acquired on the table. Thus, older partitions can be read and written into, while the newer partitions are being converted to RCFile. Whenever a partition is being locked in any mode, all its parents are locked in 'S' mode. 其它操作，会获取分区的X锁；（可能会获取表的S或X锁）如下官方链接，对此有详细的描述： https://cwiki.apache.org/confluence/display/Hive/Locking

3.2 HIVE 的 Transaction/Lock Manager

在传统的锁机制的基础上，为同时支持 HIVE ACID 事务表和非事务表，HIVE 推出了 transaction manager 和 lock manager:

A new logical entity called "transaction manager" was added which incorporated previous notion of "database/table/partition lock manager",the transaction manager is now additionally responsible for managing of transactions locks;
the default DummyTxnManager emulates behavior of old Hive versions: has no transactions and uses hive.lock.manager property to create lock manager for tables, partitions and databases; 配置 transaction manager 为 DummyTxnManager，以提供对传统锁机制的支持，此时不支持事务；
A newly added DbTxnManager manages all locks/transactions in Hive metastore with DbLockManager (transactions and locks are durable in the face of server failure). This means that previous behavior of locking in ZooKeeper is not present anymore when transactions are enabled. 配置 transaction manager 为 DbTxnManager，以提供对事务的支持，此时底层通过 DbLockManager管理事务和锁；
To avoid clients dying and leaving transaction or locks dangling, a heartbeat is sent from lock holders and transaction initiators to the metastore on a regular basis. If a heartbeat is not received in the configured amount of time, the lock or transaction will be aborted.
Note that the lock manager used by DbTxnManager will acquire locks on all tables, even those without "transactional=true" property. 当使用 DbTxnManager时，读写操作对事务表和非事务表都会加锁；
By default, Insert operation into a non-transactional table will acquire an exclusive lock and thus block other inserts and reads. While technically correct, this is a departure from how Hive traditionally worked (i.e. w/o a lock manger). 默认情况下，对非事务表的写操作，会加 X 锁；
For backwards compatibility, hive.txn.strict.locking.mode (see table below) is provided which will make this lock manager acquire shared locks on insert operations on non-transactional tables. This restores previous semantics while still providing the benefit of a lock manager such as preventing table drop while it is being read. 配置hive.txn.strict.locking.mode为false后,对非事务表的写操作，会加 S 锁；
Note that for transactional tables, insert always acquires share locks since these tables implement MVCC architecture at the storage layer and are able to provide strong read consistency (Snapshot Isolation) even in presence of concurrent modification operations. 对事务表的 INSERT 操作，会获取 S 锁，因为底层存储层的 MVCC 机制确保了 SNAPSHOT ISOLATION，确保了并发度写情况下的强一致性；

4 事务和锁的相关配置参数

事务和锁主要的相关配置参数有：

hive.txn.manager：
- 配置为 org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager 以提供类似 Hive-0.13 版本之前的传统的锁机制，此时不提供对 transactions 事务的支持；
- 配置为 org.apache.hadoop.hive.ql.lockmgr.DbTxnManager 以提供对事务表的支持；
hive.lock.manager
- 配置为 org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager，基于 zk构建锁，提供类似 Hive-0.13 版本之前的传统的锁机制，此时不提供对 transactions 事务的支持；
- 配置为 org.apache.hadoop.hive.ql.lockmgr.DbLockManager，基于 metastore db构建锁，提供了对事务表的支持；
hive.support.concurrency: true/false, 需要配置为 TURE 才能支持并发读写和事务；
hive.txn.strict.locking.mode：true/false:
- In strict mode non-ACID resources use standard R/W lock semantics, e.g. INSERT will acquire exclusive lock. 在 STRICT 模式下，非事务表获取的也是传统的 R/W读写锁，此时 INSERT 获取的时 X排他锁；
- In non-strict mode, for non-ACID resources, INSERT will only acquire shared lock, which allows two concurrent writes to the same partition but still lets lock manager prevent DROP TABLE etc. when the table is being written to (as of Hive 2.2.0). 在非严格模式下，INSERT 非事务表时会获取S锁，即允许多个作业并发写同一个非事务表；
需要说明的是，当 hive.txn.manager 配置为 org.apache.hadoop.hive.ql.lockmgr.DbTxnManager 时，不管 hive.lock.manager 配置为 org.apache.hadoop.hive.ql.lockmgr.DbLockManager还是 org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager，其底层实际使用的都是 org.apache.hadoop.hive.ql.lockmgr.DbLockManager，见源码：

事务和锁，spark 动态资源分配，以及HS2并发性，其它相关参数还有：

hive.txn.timeout
hive.txn.heartbeat.threadpool.size
spark.dynamicAllocation.enabled spark.dynamicAllocation.{initial/min/max}Executors
spark.dynamicAllocation.executorAllocationRatio
spark.dynamicAllocation.schedulerBacklogTimeout
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout
spark.dynamicAllocation.executorIdleTimeout
spark.dynamicAllocation.cachedExecutorIdleTimeout
hive.metastore.server.min.threads
hive.server2.thrift.min.worker.threads
hive.server2.thrift.max.worker.threads

参考链接：

HIVE-15774：Ensure DbLockManager backward compatibility for non-ACID resources
https://cwiki.apache.org/confluence/display/Hive/Locking
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-LockManager
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowLocks