estimator是tf中的高级API，犹在layers上，使用这个API时，只需要把输入对应正确，就可以直接通过设置参数、调用接口的方式进行训练。通过这种方式可以使代码具有很强的结构性，看着十分清晰，但是，也有缺点。因为不需要自定义train循环，所以可能会出现很多闻所未闻的输入参数，这就需要对API比较熟悉，否则捣鼓半天也只是云里雾里

介绍

先来看看tensorflow整个的框架图：

一般我们写深度学习模型使用的都是mid-level的API，大致流程有：

从文件读取数据
dataset包装数据
构建模型
构建优化器、评价指标
构建train循环训练、eval评估

但是，使用estimator的时候则有所不同，它的大致流程如下：

读取数据并包装
构建模型
调用训练API

可以看到，最大的不同就在训练，estimator直接将其集成了。同时，estimator存在已经集成好的模型，可以直接使用，而不用自定义，但是这一般是给外行人用的，哪个搞算法的会使用它自带的模型的，不自己定义个模型都不好意思说自己会DL

接口

我们使用逆向思维来一步步熟悉estimator的使用

（1）train_and_evaluate

首先是train，代码API如下：

    tf.estimator.train_and_evaluate(
        estimator=estimator,
        train_spec=train_spec,
        eval_spec=eval_spec
    )

train_and_evaluate：这个就是estimator自带的训练接口，只有三个输入

estimator：自定义的模型

train_spec：经过包装后的训练集

eval_spec：经过包装后的评测集

就像字面意思一样，这个API的功能就是train和eval

（2）构建estimator

  estimator=construct_estimator(...)
  def construct_estimator(... ):
    model_fn=model_build_fn(...)
    run_config=RunConfig(...)
    estimator=tf.estimator.Estimator(model_fn=model_fn,
                                     params={...},
                                     config=run_config)
    return estimator

这里的construct_estimator是自定义的函数，用来搭建模型，里面的参数都是自定义的，也就是你搭建模型需要什么参数，都可以传进去

里面会经过三个步骤：

model_build_fn：自定义函数，一般是这个命名，build你的模型，这里面就是你自定义模型的地方

run_config：自定义模型的配置，比如模型的存储路径，ckpt的保存方式等，这是estimator自带的API，其内部参数很多

  def __init__(self,
               model_dir=None,
               tf_random_seed=None,
               save_summary_steps=100,
               save_checkpoints_steps=_USE_DEFAULT,
               save_checkpoints_secs=_USE_DEFAULT,
               session_config=None,
               keep_checkpoint_max=5,
               keep_checkpoint_every_n_hours=10000,
               log_step_count_steps=100,
               train_distribute=None,
               device_fn=None,
               protocol=None,
               eval_distribute=None,
               experimental_distribute=None,
               experimental_max_worker_delay_secs=None):

model_dir：模型的输出路径，也就是save_path
save_checkpoints_steps：每训练多少step就保存一个ckpt
save_checkpoints_secs：每训练多久就保存一个ckpt
keep_checkpoint_max：留存的最大ckpt的数量

以上4个参数就是常用的

tf.estimator.Estimator：生成一个estimator，其初始化函数为

def __init__(self, model_fn, model_dir=None, config=None, params=None,
               warm_start_from=None):

model_fn：自定义的模型函数
model_dir：模型的保存路径，这个参数一般在config中已经设置了，所以不用再次设置，Estimator会自动调用config中的参数
config：自定义了参数的配置
params：自定义的参数，这里的参数在后面会自动传递给模型，像batch_size之类的都可以定义在这里

（3）model_build_fn

自定义的模型文件，返回一个model_fn函数

def model_build_fn(...):
    def model_fn(features, labels, mode, params):
        """
        :param features: input_fn传入
        :param labels: input_fn传入
        :param mode: estimator来定义
        :param params: 在配置estimator时设置,RunConfig
        :return:
        """
    return model_fn

其中model_build_fn的输入参数是自定义的，model_fn的输入参数是固定的，并且会自动读取input_fn的参数

在model_fn中先会搭建一个模型，然后就有三种状态，分别对应train、eval和predict

if mode == tf.estimator.ModeKeys.TRAIN:
  pass
elif mode == tf.estimator.ModeKeys.EVAL:
  pass
else:
  pass

其中mode会自动确定，应该是train_and_evaluate函数内部会根据训练或eval状态自动传递mode参数给model_fn

tf.estimator.ModeKeys.TRAIN：会用tf.estimator.EstimatorSpec对训练需要的操作进行包装。训练一般需要loss、train_op（优化器操作更新梯度），所以训练阶段model_fn返回的是：

  output_spec = tf.estimator.EstimatorSpec(
      mode=mode,
      loss=loss,
      train_op=train_op
  )

这里先来看看tf.estimator.EstimatorSpec的初始化函数：

  def __new__(cls,
              mode,
              predictions=None,
              loss=None,
              train_op=None,
              eval_metric_ops=None,
              export_outputs=None,
              training_chief_hooks=None,
              training_hooks=None,
              scaffold=None,
              evaluation_hooks=None,
              prediction_hooks=None):

可以看到，其中predictions肯定用于PREDICT，eval_metric_ops用于EVAL，training_hooks、evaluation_hooks和prediction_hooks，应该是打印日志的参数，可用可不用；scaffold是用于设置初始化和saver的，一般用于训练，当然也可以不用；export_outputs：用于在保存为pb文件时，设置模型的tag，如果模型需要部署就要使用

由于训练需要计算loss以及更新梯度，所以训练阶段一般会传入loss和train_op，mode是必须要传入的

tf.estimator.ModeKeys.EVAL：eval阶段也会进行同样的包装操作，不过eval传入的是loss和eval_metric_ops，其中eval_metric_ops是一个dict格式的数据

acc = tf.metrics.accuracy(labels=label_ids, predictions=pred)
p = tf.metrics.precision(labels=label_ids, predictions=pred)
r = tf.metrics.recall(labels=label_ids, predictions=pred)
metrics = {
                'acc': acc,
                'p': p,
                'r': r,
            }

tf.estimator.ModeKeys.PREDICT：预测阶段传入和eval比较像，不过只需要传入predictions，也是dict的格式

prediction = {'pred_label': pred,
                          'proba': tf.nn.softmax(logits),
                          'logits': logits,
                          'truth_label': label_ids,
                          'input_ids': input_ids,
                          'input_mask': input_mask,
                          }

到这里整个estimator的搭建就讲完了，下面就是input_fn的搭建

（4）input_fn

回到之前的train_and_evaluate的位置：

tf.estimator.train_and_evaluate(
        estimator=estimator,
        train_spec=train_spec,
        eval_spec=eval_spec
    )

我们已经成功定义了第一个参数，下面两个参数显而易见就是训练集与评测集了

它们都使用estimator自带的API进行包装：

    train_spec=tf.estimator.TrainSpec(
        input_fn=train_input_fn,
        max_steps=train_steps
    )
    
    eval_spec=tf.estimator.EvalSpec(
    input_fn=dev_input_fn,
    steps=None,
    start_delay_secs=60,
    throttle_secs=60,
    exporters=best_ckpt_exporter,
)

其中TrainSpec的初始化函数如下：

def __new__(cls, input_fn, max_steps=None, hooks=None):

可见最关键的输入就是input_fn，也就是我们包装好的train，max_steps就是模型训练的最大步数，一般还是需要设置下，免得报错，train_and_evaluate有自带的停止机制，hooks和日志相关，不用管

所以，使用TrainSpec会传入包装好的train_input_fn和train_steps

EvalSpec稍显复杂，其初始化函数如下：

  def __new__(cls,
              input_fn,
              steps=100,
              name=None,
              hooks=None,
              exporters=None,
              start_delay_secs=120,
              throttle_secs=600):

虽然参数比较多但我们需要关注的只有以下几个：

input_fn：输入包装好的dev_input_fn
steps：评估模型的step数，看情况而定，一般可以设置为dev的大小，也可以设置为None，模型会自动停止
exporters：用来输出模型的实例，输出的是best-ckpt。它会调用Exporter，然后与之前输出的best-ckpt进行对比，当此时的best-ckpt指标更好时，就会保存下来，同时当保存的best-ckpt超过设置的最大保存参数时，就会将前面删除一个。这里的最大保存参数应该与前面的keep_checkpoint_max一样

现在，前面都分析好了，下面就是如何生成input_fn

对于train_input_fn，一般通过以下方法得到：

train_input_fn=file_base_input_fn_builder(input_file=train_file,
                                          max_seq_len=max_seq_len,
                                          is_training=True)

这里需要说明的是，有两种方式去形成input_fn

读取数据，转化数据，保存在CPU缓存以后使用，这种适合小数据集
当数据集较大的时候就需要先将转化后的数据保存为tf_record格式，当需要使用的时候再读取出来，减小CPU的负担

可以看到，这里输入的是train_file，因为训练集一般比较大，所以建议先转化为tf_record文件，然后使用读取文件的方式进行包装，转化为td_record文件的代码如下：

dp=DataProcessor(max_seq_len=Flags.max_seq_len)
dp.file_base_convert_examples_to_features(examples=train_data,
                                              label2id_map=label2id_map,
                                              tokenizer=tokenizer,
                                              output_file=train_file)

其中dp是一个自定义的数据处理器，里面定义了大部分数据处理的操作

其中file_base_convert_examples_to_features就是将数据集文件的先读取，然后转化为tf_record，接着保存，其实现如下：

    def file_base_convert_examples_to_features(self, examples, label2id_map, tokenizer, output_file):
        # 将数据转化为features后，保存在tf_record文件中
        tf.logging.info("*** starting convert data to tf_record ***")
        writer = tf.io.TFRecordWriter(output_file)
        for (idx, example) in enumerate(examples):

            feature = self.convert_single_example(idx=idx, example=example, tokenizer=tokenizer, label_map=label2id_map)
            if feature is None:
                continue

            def create_int_feature(values):
                f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
                return f

            features = collections.OrderedDict()
            features['input_ids'] = create_int_feature(feature.input_ids)
            features['label_ids'] = create_int_feature([feature.label_ids])
            features['length'] = create_int_feature([feature.length])

            tf_example = tf.train.Example(features=tf.train.Features(feature=features))
            writer.write(tf_example.SerializeToString())

其实，比较容易理解，就是遍历数据一个个转换，最后将得到的features序列化后保存。我们可以注意到其中有convert_single_example函数，使用来转化单条数据的，其实现如下：

    def convert_single_example(self, idx, example, tokenizer, label_map):
        # 将单条数据转化为特征
        tokens = [token for token in example.text]
        tokens = ["[CLS]"] + tokens + ["[SEP]"]
        length = len(tokens)
        if length>self.max_seq_len:
            length=self.max_seq_len
        
        input_ids = tokenizer.convert_tokens_to_ids(tokens)

        if length < self.max_seq_len:
            input_ids += [0] * (self.max_seq_len - length)
        else:
            input_ids = input_ids[:self.max_seq_len]

        assert len(input_ids) == self.max_seq_len

        label = example.label  # 一个样本
        label_ids = None
        # 我们的任务为分类任务
        if label:
            if type(label)==list:
                label_ids=[label_map[i] for i in labels]
            else:
                label_ids = label_map[label]

        feature = DataFeature(
                              input_ids=input_ids,
                              label_ids=label_ids,
                              length=length
                              )
        return feature

其中：

input_ids：存放的是句子token对应的ids

label_ids：存放的是该条样本对应的标签的id，任务类型不同会有所不同，如cls和tagging任务的就不一样

length：就是加上特殊符号后句子的长度，注意，这里的长度是没有算PAD的，但是，有截断，也就是说表示的是句子有用字符的长度

然后使用DataFeature将数据包装为feature，其中DataFeature如下：

class DataFeature:
    """
    用来将需要的数据包装为feature
    """
    def __init__(self, input_ids, label_ids, length):
        self.input_ids = input_ids
        self.label_ids = label_ids
        self.length = length

到这里我们就讲完了训练集转化为input_fn的所有流程，现在从前往后梳理下：

读取训练集数据
将训练集数据转化为features的序列化的形式，并保存为tf_record文件
- 将训练集每条数据使用convert_single_example转化为feature

评测集的数据也可以用同样的方法实现，但是，一般评测集比较小，所以为了加快速度，我们一般不会将其保存为tf_record文件，而是直接放在CPU缓存

dev_input_fn=input_fn_builder(
                                  features=dp.convert_example_to_features(dev_data,label2id_map,tokenizer),
                                  is_training=False,
                                  max_seq_len=max_seq_len)

可以看到评测集数据通过convert_example_to_features实现转化

    def convert_example_to_features(self, examples, label2id_map, tokenizer):
        # 将数据转化为特征，当数据量较小的时候使用，因为需要把转换后的数据放到CPU的缓存
        features = []
        for idx, example in enumerate(examples):
            feature = self.convert_single_example(idx=idx, example=example, tokenizer=tokenizer, label_map=label2id_map)
            if feature is None:
                continue
            features.append(feature)
        return features

这里的代码看着就相对简洁一点，但是和训练集转化得原理是一样的

此时，我们的train_input_fn和dev_input_fn都构建完毕了，但是，别忘了，之前在EVAL的时候会保存best-ckpt，当时只在EvalSpec中输入了exporters，所以，下面需要对exporters进行定义

（5）best_ckpt_exporter

输出最优模型，其代码如下：

    best_ckpt_exporter=BestCheckpointsExporter(
        serving_input_receiver_fn=serving_fn,
        best_checkpoint_path=best_ckpt_dir,
        compare_fn=loss_smaller
    )

就是实例化BestCheckpointsExporter，这是一个自定义的export类，继承于tf.estimator.BestExporter，其中各参数的含义如下：

serving_input_receiver_fn：一个函数类型的参数，其定义如下

def serving_fn():
    input_ids=tf.placeholder(tf.int32,[None,None],name='input_ids')
    lengths=tf.placeholder(tf.int32,[None],name='length')

    input_fn=tf.estimator.export.build_raw_serving_input_receiver_fn({
        'input_ids':input_ids,
        'length':lengths
    })()

    return input_fn

返回的是一个ServingInputReceiver，这里应该是为方便部署才定义的一个serving_fn函数，一般情况下，如果只需要保存ckpt，可以直接用tf.estimator.BestExporter.export，输入对应的参数后就能实现

def export(self, estimator, export_path, checkpoint_path, eval_result,is_the_final_export):

best_checkpoint_path：ckpt保存的路径
compare_fn：比较函数，就是你通过那个指标来判断，当前的ckpt比上一个best-ckpt更好

def loss_smaller(best_eval_result, current_eval_result):
    default_key = "loss"
    if not best_eval_result or default_key not in best_eval_result:
        raise ValueError(
            'best_eval_result cannot be empty or no loss is found in it.')

    if not current_eval_result or default_key not in current_eval_result:
        raise ValueError(
            'current_eval_result cannot be empty or no loss is found in it.')

    return best_eval_result[default_key] > current_eval_result[default_key]

loss_smaller的实现基本和源码是一样的

（6）预测

最后就是预测，这里我们需要经历以下几步：

搭建一个一模一样的模型
给模型导入最优的参数
预测

因为需要使用有最优参数的模型进行预测，所以之前训练好的estimator并不能用，需要重新搭建模型，然后初始化，predict进行预测

到这里，整个过程就结束了，代码写熟练后，会发现estimator用起来并不难

总结

上面是从后往前说的，现在就从前往后做一个小的总结

读取训练集、评测集
训练集、评测集转化为train_input_fn和dev_input_fn
定义best_ckpt_exporter
根据train_input_fn和dev_input_fn生成train_spec和eval_spec
使用train_and_evaluate训练
预测

上面就是整个estimator实现训练的过程了，用点心其实也不是特别难