diff --git a/tfx_labs/Lab_5_TensorFlow_Transform.ipynb b/tfx_labs/Lab_5_TensorFlow_Transform.ipynb index ee909e3a..f943d405 100644 --- a/tfx_labs/Lab_5_TensorFlow_Transform.ipynb +++ b/tfx_labs/Lab_5_TensorFlow_Transform.ipynb @@ -1 +1,856 @@ -{"cells":[{"cell_type":"markdown","metadata":{"colab_type":"text","id":"tghWegsjhpkt"},"source":["##### Copyright © 2019 The TensorFlow Authors."]},{"cell_type":"code","execution_count":0,"metadata":{"cellView":"form","colab":{},"colab_type":"code","id":"rSGJWC5biBiG"},"outputs":[],"source":"#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n# https://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License."},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"mPt5BHTwy_0F"},"source":["# Preprocessing data with TensorFlow Transform\n","***The Feature Engineering Component of TensorFlow Extended (TFX)***\n","\n","This example colab notebook provides a somewhat more advanced example of how TensorFlow Transform (`tf.Transform`) can be used to preprocess data using exactly the same code for both training a model and serving inferences in production.\n","\n","TensorFlow Transform is a library for preprocessing input data for TensorFlow, including creating features that require a full pass over the training dataset. For example, using TensorFlow Transform you could:\n","\n","* Normalize an input value by using the mean and standard deviation\n","* Convert strings to integers by generating a vocabulary over all of the input values\n","* Convert floats to integers by assigning them to buckets, based on the observed data distribution\n","\n","TensorFlow has built-in support for manipulations on a single example or a batch of examples. `tf.Transform` extends these capabilities to support full passes over the entire training dataset.\n","\n","The output of `tf.Transform` is exported as a TensorFlow graph which you can use for both training and serving. Using the same graph for both training and serving can prevent skew, since the same transformations are applied in both stages.\n","\n","Key Point: In order to understand `tf.Transform` and how it works with Apache Beam, you'll need to know a little bit about Apache Beam itself. The Beam Programming Guide is a great place to start."]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"_tQUubddMvnP"},"source":"## What we're doing in this example\n\nIn this example we'll be processing a widely used dataset containing census data, and training a model to do classification. Along the way we'll be transforming the data using `tf.Transform`.\n\nKey Point: As a modeler and developer, think about how this data is used and the potential benefits and harm a model's predictions can cause. A model like this could reinforce societal biases and disparities. Is a feature relevant to the problem you want to solve or will it introduce bias? For more information, read about ML fairness.\n\nNote: TensorFlow Model Analysis is a powerful tool for understanding how well your model predicts for various segments of your data, including understanding how your model may reinforce societal biases and disparities."},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"RptgLn2RYuK3"},"source":"## Setup\nFirst, we install the necessary packages, download data, import modules and set up paths.\n\n### Install TensorFlow, TensorFlow Transform, and Apache Beam\n\n> #### Note\n> Because of some of the updates to packages you must use the button at the bottom of the output of this cell to restart the runtime. Following restart, you should rerun this cell."},{"cell_type":"code","execution_count":0,"metadata":{"colab":{},"colab_type":"code","id":"JZi8MTlV8qTM"},"outputs":[],"source":"!pip install -U tensorflow==1.15\n!pip install -U tensorflow_transform\n!pip install -U apache_beam\n!pip install -q -U pyarrow==0.14.1"},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"OT4zPmr9B8eH"},"source":["### Import packages\n","We import necessary packages, including Beam."]},{"cell_type":"code","execution_count":0,"metadata":{"colab":{},"colab_type":"code","id":"K4QXVIM7iglN"},"outputs":[],"source":"import argparse\nimport logging\nimport os\nimport pprint\nimport tempfile\nimport urllib\nimport zipfile\n\nimport tensorflow.compat.v2 as tf\ntf.enable_v2_behavior()\ntf.get_logger().setLevel(logging.ERROR)\ntf.get_logger().propagate = False\n\nimport tensorflow_transform as tft\nimport tensorflow_transform.beam\nimport apache_beam as beam\n\nfrom tensorflow_transform.tf_metadata import dataset_metadata\nfrom tensorflow_transform.tf_metadata import dataset_schema"},{"cell_type":"code","execution_count":0,"metadata":{"colab":{},"colab_type":"code","id":"qq3TD5WL8_oE"},"outputs":[],"source":"print('TensorFlow version: {}'.format(tf.__version__))\nprint('TensorFlow Transform version: {}'.format(tft.__version__))\nprint('Apache Beam version: {}'.format(beam.__version__))"},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"eTp9YWvm9hLw"},"source":["Download and split the census data:"]},{"cell_type":"code","execution_count":0,"metadata":{"colab":{},"colab_type":"code","id":"Q5Wh3Zi681U2"},"outputs":[],"source":"temp = tempfile.gettempdir()\nzip_, headers = urllib.request.urlretrieve('https://storage.googleapis.com/tfx-colab-datasets/census.zip')\nzipfile.ZipFile(zip_).extractall(temp)\nzipfile.ZipFile(zip_).close()\n\ntrain = os.path.join(temp, 'census/adult.data')\ntest = os.path.join(temp, 'census/adult.test')"},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"CxOxaaOYRfl7"},"source":["### Name our columns\n","We'll create some handy lists for referencing the columns in our dataset."]},{"cell_type":"code","execution_count":0,"metadata":{"colab":{},"colab_type":"code","id":"-bsr1nLHqyg_"},"outputs":[],"source":"CATEGORICAL_FEATURE_KEYS = [\n 'workclass',\n 'education',\n 'marital-status',\n 'occupation',\n 'relationship',\n 'race',\n 'sex',\n 'native-country',\n]\nNUMERIC_FEATURE_KEYS = [\n 'age',\n 'capital-gain',\n 'capital-loss',\n 'hours-per-week',\n]\nOPTIONAL_NUMERIC_FEATURE_KEYS = [\n 'education-num',\n]\nLABEL_KEY = 'label'"},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"qtTn4at8rurk"},"source":["###Define our features and schema\n","Let's define a schema based on what types the columns are in our input. Among other things this will help with importing them correctly."]},{"cell_type":"code","execution_count":0,"metadata":{"colab":{},"colab_type":"code","id":"5oS2RfyCrzMr"},"outputs":[],"source":"RAW_DATA_FEATURE_SPEC = dict(\n [(name, tf.io.FixedLenFeature([], tf.string))\n for name in CATEGORICAL_FEATURE_KEYS] +\n [(name, tf.io.FixedLenFeature([], tf.float32))\n for name in NUMERIC_FEATURE_KEYS] +\n [(name, tf.io.VarLenFeature(tf.float32))\n for name in OPTIONAL_NUMERIC_FEATURE_KEYS] +\n [(LABEL_KEY, tf.io.FixedLenFeature([], tf.string))]\n)\n\nRAW_DATA_METADATA = dataset_metadata.DatasetMetadata(\n dataset_schema.from_feature_spec(RAW_DATA_FEATURE_SPEC))"},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"zdXy9lo4t45d"},"source":["###Setting hyperparameters and basic housekeeping\n","Constants and hyperparameters used for training. The bucket size includes all listed categories in the dataset description as well as one extra for \"?\" which represents unknown.\n","\n","Note: The number of instances will be computed by `tf.Transform` in future versions, in which case it can be read from the metadata. Similarly BUCKET_SIZES will not be needed as this information will be stored in the metadata for each of the columns."]},{"cell_type":"code","execution_count":0,"metadata":{"colab":{},"colab_type":"code","id":"8WHyOkC9uL71"},"outputs":[],"source":"testing = os.getenv(\"WEB_TEST_BROWSER\", False)\nif testing:\n TRAIN_NUM_EPOCHS = 1\n NUM_TRAIN_INSTANCES = 1\n TRAIN_BATCH_SIZE = 1\n NUM_TEST_INSTANCES = 1\nelse:\n TRAIN_NUM_EPOCHS = 16\n NUM_TRAIN_INSTANCES = 32561\n TRAIN_BATCH_SIZE = 128\n NUM_TEST_INSTANCES = 16281\n\n# Names of temp files\nTRANSFORMED_TRAIN_DATA_FILEBASE = 'train_transformed'\nTRANSFORMED_TEST_DATA_FILEBASE = 'test_transformed'\nEXPORTED_MODEL_DIR = 'exported_model_dir'"},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"9ev2P3cHcrM9"},"source":["## Cleaning\n","\n","### Create a Beam Transform for cleaning our input data\n","\n","We'll create a **Beam Transform** by creating a subclass of Apache Beam's `PTransform` class and overriding the `expand` method to specify the actual processing logic. A `PTransform` represents a data processing operation, or a step, in your pipeline. Every `PTransform` takes one or more `PCollection` objects as input, performs a processing function that you provide on the elements of that `PCollection`, and produces zero or more output PCollection objects.\n","\n","Our transform class will apply Beam's `ParDo` on the input `PCollection` containing our census dataset, producing clean data in an output `PCollection`.\n","\n","Key Point: The `expand` method of a `PTransform` is not meant to be invoked directly by the user of a transform. Instead, you should call the `apply` method on the `PCollection` itself, with the transform as an argument. This allows transforms to be nested within the structure of your pipeline."]},{"cell_type":"code","execution_count":0,"metadata":{"colab":{},"colab_type":"code","id":"uCpstiSCessP"},"outputs":[],"source":"class MapAndFilterErrors(beam.PTransform):\n \"\"\"Like beam.Map but filters out erros in the map_fn.\"\"\"\n\n class _MapAndFilterErrorsDoFn(beam.DoFn):\n \"\"\"Count the bad examples using a beam metric.\"\"\"\n\n def __init__(self, fn):\n self._fn = fn\n # Create a counter to measure number of bad elements.\n self._bad_elements_counter = beam.metrics.Metrics.counter(\n 'census_example', 'bad_elements')\n\n def process(self, element):\n try:\n yield self._fn(element)\n except Exception: # pylint: disable=broad-except\n # Catch any exception the above call.\n self._bad_elements_counter.inc(1)\n\n def __init__(self, fn):\n self._fn = fn\n\n def expand(self, pcoll):\n return pcoll | beam.ParDo(self._MapAndFilterErrorsDoFn(self._fn))"},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"0a1ns5KswDb2"},"source":["## Preprocessing with `tf.Transform`\n","\n","### Create a `tf.Transform` preprocessing_fn\n","\n","The _preprocessing function_ is the most important concept of tf.Transform. A preprocessing function is where the transformation of the dataset really happens. It accepts and returns a dictionary of tensors, where a tensor means a [`Tensor`](https://www.tensorflow.org/api_docs/python/tf/Tensor) or [`SparseTensor`](https://www.tensorflow.org/api_docs/python/tf/SparseTensor). There are two main groups of API calls that typically form the heart of a preprocessing function:\n","\n","1. **TensorFlow Ops:** Any function that accepts and returns tensors, which usually means TensorFlow ops. These add TensorFlow operations to the graph that transforms raw data into transformed data one feature vector at a time. These will run for every example, during both training and serving.\n","2. **TensorFlow Transform Analyzers:** Any of the analyzers provided by tf.Transform. Analyzers also accept and return tensors, but unlike TensorFlow ops they only run once, during training, and typically make a full pass over the entire training dataset. They create [tensor constants](https://www.tensorflow.org/api_docs/python/tf/constant), which are added to your graph. For example, `tft.min` computes the minimum of a tensor over the training dataset. tf.Transform provides a fixed set of analyzers, but this will be extended in future versions.\n","\n","Caution: When you apply your preprocessing function to serving inferences, the constants that were created by analyzers during training do not change. If your data has trend or seasonality components, plan accordingly."]},{"cell_type":"code","execution_count":0,"metadata":{"colab":{},"colab_type":"code","id":"MWzml7Dl04XQ"},"outputs":[],"source":"def preprocessing_fn(inputs):\n \"\"\"Preprocess input columns into transformed columns.\"\"\"\n # Since we are modifying some features and leaving others unchanged, we\n # start by setting `outputs` to a copy of `inputs.\n outputs = inputs.copy()\n\n # Scale numeric columns to have range [0, 1].\n for key in NUMERIC_FEATURE_KEYS:\n outputs[key] = tft.scale_to_0_1(outputs[key])\n\n for key in OPTIONAL_NUMERIC_FEATURE_KEYS:\n # This is a SparseTensor because it is optional. Here we fill in a default\n # value when it is missing.\n dense = tf.sparse.to_dense(outputs[key], default_value=0.)\n # Reshaping from a batch of vectors of size 1 to a batch to scalars.\n dense = tf.squeeze(dense, axis=1)\n outputs[key] = tft.scale_to_0_1(dense)\n\n # For all categorical columns except the label column, we generate a\n # vocabulary but do not modify the feature. This vocabulary is instead\n # used in the trainer, by means of a feature column, to convert the feature\n # from a string to an integer id.\n for key in CATEGORICAL_FEATURE_KEYS:\n tft.vocabulary(inputs[key], vocab_filename=key)\n\n # For the label column we provide the mapping from string to index.\n initializer = tf.lookup.KeyValueTensorInitializer(\n keys=['>50K', '<=50K'],\n values=tf.cast(tf.range(2), tf.int64),\n key_dtype=tf.string,\n value_dtype=tf.int64)\n table = tf.lookup.StaticHashTable(initializer, default_value=-1)\n\n #table = tf.contrib.lookup.index_table_from_tensor(['>50K', '<=50K'])\n\n outputs[LABEL_KEY] = table.lookup(outputs[LABEL_KEY])\n\n return outputs\n"},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"rgAGOAdFWRn2"},"source":["###Transform the data\n","Now we're ready to start transforming our data in an Apache Beam pipeline.\n","\n","1. Read in the data using the CSV reader\n","1. Clean it using our new `MapAndFilterErrors` transform\n","1. Transform it using a preprocessing pipeline that scales numeric data and converts categorical data from strings to int64 values indices, by creating a vocabulary for each category\n","1. Write out the result as a `TFRecord` of `Example` protos, which we will use for training a model later\n",""]},{"cell_type":"code","execution_count":0,"metadata":{"colab":{},"colab_type":"code","id":"BQAPzdCRwi5d"},"outputs":[],"source":"def transform_data(train_data_file, test_data_file, working_dir):\n \"\"\"Transform the data and write out as a TFRecord of Example protos.\n\n Read in the data using the CSV reader, and transform it using a\n preprocessing pipeline that scales numeric data and converts categorical data\n from strings to int64 values indices, by creating a vocabulary for each\n category.\n\n Args:\n train_data_file: File containing training data\n test_data_file: File containing test data\n working_dir: Directory to write transformed data and metadata to\n \"\"\"\n\n # The \"with\" block will create a pipeline, and run that pipeline at the exit\n # of the block.\n with beam.Pipeline() as pipeline:\n with tft.beam.Context(temp_dir=tempfile.mkdtemp()):\n # Create a coder to read the census data with the schema. To do this we\n # need to list all columns in order since the schema doesn't specify the\n # order of columns in the csv.\n ordered_columns = [\n 'age', 'workclass', 'fnlwgt', 'education', 'education-num',\n 'marital-status', 'occupation', 'relationship', 'race', 'sex',\n 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',\n 'label'\n ]\n converter = tft.coders.CsvCoder(ordered_columns, RAW_DATA_METADATA.schema)\n\n # Read in raw data and convert using CSV converter. Note that we apply\n # some Beam transformations here, which will not be encoded in the TF\n # graph since we don't do the from within tf.Transform's methods\n # (AnalyzeDataset, TransformDataset etc.). These transformations are just\n # to get data into a format that the CSV converter can read, in particular\n # removing spaces after commas.\n #\n # We use MapAndFilterErrors instead of Map to filter out decode errors in\n # convert.decode which should only occur for the trailing blank line.\n raw_data = (\n pipeline\n | 'ReadTrainData' >> beam.io.ReadFromText(train_data_file)\n | 'FixCommasTrainData' >> beam.Map(\n lambda line: line.replace(', ', ','))\n | 'DecodeTrainData' >> MapAndFilterErrors(converter.decode))\n\n # Combine data and schema into a dataset tuple. Note that we already used\n # the schema to read the CSV data, but we also need it to interpret\n # raw_data.\n raw_dataset = (raw_data, RAW_DATA_METADATA)\n transformed_dataset, transform_fn = (\n raw_dataset | tft.beam.AnalyzeAndTransformDataset(preprocessing_fn))\n transformed_data, transformed_metadata = transformed_dataset\n transformed_data_coder = tft.coders.ExampleProtoCoder(\n transformed_metadata.schema)\n\n _ = (\n transformed_data\n | 'EncodeTrainData' >> beam.Map(transformed_data_coder.encode)\n | 'WriteTrainData' >> beam.io.WriteToTFRecord(\n os.path.join(working_dir, TRANSFORMED_TRAIN_DATA_FILEBASE)))\n\n # Now apply transform function to test data. In this case we remove the\n # trailing period at the end of each line, and also ignore the header line\n # that is present in the test data file.\n raw_test_data = (\n pipeline\n | 'ReadTestData' >> beam.io.ReadFromText(test_data_file,\n skip_header_lines=1)\n | 'FixCommasTestData' >> beam.Map(\n lambda line: line.replace(', ', ','))\n | 'RemoveTrailingPeriodsTestData' >> beam.Map(lambda line: line[:-1])\n | 'DecodeTestData' >> MapAndFilterErrors(converter.decode))\n\n raw_test_dataset = (raw_test_data, RAW_DATA_METADATA)\n\n transformed_test_dataset = (\n (raw_test_dataset, transform_fn) | tft.beam.TransformDataset())\n # Don't need transformed data schema, it's the same as before.\n transformed_test_data, _ = transformed_test_dataset\n\n _ = (\n transformed_test_data\n | 'EncodeTestData' >> beam.Map(transformed_data_coder.encode)\n | 'WriteTestData' >> beam.io.WriteToTFRecord(\n os.path.join(working_dir, TRANSFORMED_TEST_DATA_FILEBASE)))\n\n # Will write a SavedModel and metadata to working_dir, which can then\n # be read by the tft.TFTransformOutput class.\n _ = (\n transform_fn\n | 'WriteTransformFn' >> tft.beam.WriteTransformFn(working_dir))"},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"TnaMyRMJ03bR"},"source":["##Using our preprocessed data to train a model\n","\n","To show how `tf.Transform` enables us to use the same code for both training and serving, and thus prevent skew, we're going to train a model. To train our model and prepare our trained model for production we need to create input functions. The main difference between our training input function and our serving input function is that training data contains the labels, and production data does not. The arguments and returns are also somewhat different."]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"M8xCZKNc2wAS"},"source":["###Create an input function for training"]},{"cell_type":"code","execution_count":0,"metadata":{"colab":{},"colab_type":"code","id":"kFO0MeWQ228a"},"outputs":[],"source":"def _make_training_input_fn(tf_transform_output, transformed_examples,\n batch_size):\n \"\"\"Creates an input function reading from transformed data.\n\n Args:\n tf_transform_output: Wrapper around output of tf.Transform.\n transformed_examples: Base filename of examples.\n batch_size: Batch size.\n\n Returns:\n The input function for training or eval.\n \"\"\"\n def input_fn():\n \"\"\"Input function for training and eval.\"\"\"\n dataset = tf.data.experimental.make_batched_features_dataset(\n file_pattern=transformed_examples,\n batch_size=batch_size,\n features=tf_transform_output.transformed_feature_spec(),\n reader=tf.data.TFRecordDataset,\n shuffle=True)\n\n transformed_features = dataset.make_one_shot_iterator().get_next()\n\n # Extract features and label from the transformed tensors.\n transformed_labels = transformed_features.pop(LABEL_KEY)\n\n return transformed_features, transformed_labels\n\n return input_fn"},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"nYeuthrs27vl"},"source":["###Create an input function for serving\n","\n","Let's create an input function that we could use in production, and prepare our trained model for serving."]},{"cell_type":"code","execution_count":0,"metadata":{"colab":{},"colab_type":"code","id":"NN5FVg343Jea"},"outputs":[],"source":"def _make_serving_input_fn(tf_transform_output):\n \"\"\"Creates an input function reading from raw data.\n\n Args:\n tf_transform_output: Wrapper around output of tf.Transform.\n\n Returns:\n The serving input function.\n \"\"\"\n raw_feature_spec = RAW_DATA_FEATURE_SPEC.copy() #RAW_DATA_METADATA.schema.as_feature_spec()\n # Remove label since it is not available during serving.\n raw_feature_spec.pop(LABEL_KEY)\n\n def serving_input_fn():\n \"\"\"Input function for serving.\"\"\"\n # Get raw features by generating the basic serving input_fn and calling it.\n # Here we generate an input_fn that expects a parsed Example proto to be fed\n # to the model at serving time. See also\n # tf.estimator.export.build_raw_serving_input_receiver_fn.\n raw_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(\n raw_feature_spec, default_batch_size=None)\n serving_input_receiver = raw_input_fn()\n\n # Apply the transform function that was used to generate the materialized\n # data.\n raw_features = serving_input_receiver.features\n transformed_features = tf_transform_output.transform_raw_features(\n raw_features)\n\n return tf.estimator.export.ServingInputReceiver(\n transformed_features, serving_input_receiver.receiver_tensors)\n\n return serving_input_fn"},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"Vc9Edp8A7dsI"},"source":["###Wrap our input data in FeatureColumns\n","Our model will expect our data in TensorFlow FeatureColumns."]},{"cell_type":"code","execution_count":0,"metadata":{"colab":{},"colab_type":"code","id":"6qOFOvBk7oJX"},"outputs":[],"source":"def get_feature_columns(tf_transform_output):\n \"\"\"Returns the FeatureColumns for the model.\n\n Args:\n tf_transform_output: A `TFTransformOutput` object.\n\n Returns:\n A list of FeatureColumns.\n \"\"\"\n # Wrap scalars as real valued columns.\n real_valued_columns = [tf.feature_column.numeric_column(key, shape=())\n for key in NUMERIC_FEATURE_KEYS]\n\n # Wrap categorical columns.\n one_hot_columns = [\n tf.feature_column.categorical_column_with_vocabulary_file(\n key=key,\n vocabulary_file=tf_transform_output.vocabulary_file_by_name(\n vocab_filename=key))\n for key in CATEGORICAL_FEATURE_KEYS]\n\n return real_valued_columns + one_hot_columns"},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"LyNTX7CO8AAz"},"source":["##Train, Evaluate, and Export our model"]},{"cell_type":"code","execution_count":0,"metadata":{"colab":{},"colab_type":"code","id":"8iGQ0jeq8IWr"},"outputs":[],"source":"def train_and_evaluate(working_dir, num_train_instances=NUM_TRAIN_INSTANCES,\n num_test_instances=NUM_TEST_INSTANCES):\n \"\"\"Train the model on training data and evaluate on test data.\n\n Args:\n working_dir: Directory to read transformed data and metadata from and to\n write exported model to.\n num_train_instances: Number of instances in train set\n num_test_instances: Number of instances in test set\n\n Returns:\n The results from the estimator's 'evaluate' method\n \"\"\"\n tf_transform_output = tft.TFTransformOutput(working_dir)\n run_config = tf.estimator.RunConfig()\n\n estimator = tf.estimator.LinearClassifier(\n feature_columns=get_feature_columns(tf_transform_output),\n config=run_config)\n\n # Fit the model using the default optimizer.\n train_input_fn = _make_training_input_fn(\n tf_transform_output,\n os.path.join(working_dir, TRANSFORMED_TRAIN_DATA_FILEBASE + '*'),\n batch_size=TRAIN_BATCH_SIZE)\n estimator.train(\n input_fn=train_input_fn,\n max_steps=TRAIN_NUM_EPOCHS * num_train_instances / TRAIN_BATCH_SIZE)\n\n # Evaluate model on test dataset.\n eval_input_fn = _make_training_input_fn(\n tf_transform_output,\n os.path.join(working_dir, TRANSFORMED_TEST_DATA_FILEBASE + '*'),\n batch_size=1)\n\n # Export the model.\n serving_input_fn = _make_serving_input_fn(tf_transform_output)\n exported_model_dir = os.path.join(working_dir, EXPORTED_MODEL_DIR)\n estimator.export_saved_model(exported_model_dir, serving_input_fn)\n\n return estimator.evaluate(input_fn=eval_input_fn, steps=num_test_instances)"},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"JrFJ9Zax8QAh"},"source":["## Put it all together\n","\n","We've created all the stuff we need to preprocess our census data, train a model, and prepare it for serving. So far we've just been getting things ready. It's time to start running!\n","\n",">### Note:\n","> Scroll the output from this cell to see the whole process. The results will be at the bottom."]},{"cell_type":"code","execution_count":0,"metadata":{"colab":{},"colab_type":"code","id":"xe7JdSin86Ez"},"outputs":[],"source":"import time\n\nstart = time.time()\n\ntransform_data(train, test, temp)\nprint('Transform took {:.2f} seconds'.format(time.time() - start))\nresults = train_and_evaluate(temp)\nprint('Transform and training took {:.2f} seconds'.format(time.time() - start))\npprint.pprint(results)"},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"ICqetCnSjwp1"},"source":["##What we did\n","In this example we used `tf.Transform` to preprocess a dataset of census data, and train a model with the cleaned and transformed data. We also created an input function that we could use when we deploy our trained model in a production environment to perform inference. By using the same code for both training and inference we avoid any issues with data skew. Along the way we learned about creating an Apache Beam transform to perform the transformation that we needed for cleaing the data, and wrapped our data in TensorFlow `FeatureColumns`. This is just a small piece of what TensorFlow Transform can do! We encourage you to dive into `tf.Transform` and discover what it can do for you."]}],"nbformat":4,"nbformat_minor":2,"metadata":{"language_info":{"name":"python","codemirror_mode":{"name":"ipython","version":3}},"orig_nbformat":2,"file_extension":".py","mimetype":"text/x-python","name":"python","npconvert_exporter":"python","pygments_lexer":"ipython3","version":3}} \ No newline at end of file +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "TFX Lab 5 – Preprocessing data with TensorFlow Transform", + "provenance": [], + "private_outputs": true, + "collapsed_sections": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "tghWegsjhpkt" + }, + "source": [ + "##### Copyright © 2019 The TensorFlow Authors." + ] + }, + { + "cell_type": "code", + "metadata": { + "cellView": "form", + "colab_type": "code", + "id": "rSGJWC5biBiG", + "colab": {} + }, + "source": [ + "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "mPt5BHTwy_0F" + }, + "source": [ + "# Preprocessing data with TensorFlow Transform\n", + "***The Feature Engineering Component of TensorFlow Extended (TFX)***\n", + "\n", + "This example colab notebook provides a somewhat more advanced example of how TensorFlow Transform (`tf.Transform`) can be used to preprocess data using exactly the same code for both training a model and serving inferences in production.\n", + "\n", + "TensorFlow Transform is a library for preprocessing input data for TensorFlow, including creating features that require a full pass over the training dataset. For example, using TensorFlow Transform you could:\n", + "\n", + "* Normalize an input value by using the mean and standard deviation\n", + "* Convert strings to integers by generating a vocabulary over all of the input values\n", + "* Convert floats to integers by assigning them to buckets, based on the observed data distribution\n", + "\n", + "TensorFlow has built-in support for manipulations on a single example or a batch of examples. `tf.Transform` extends these capabilities to support full passes over the entire training dataset.\n", + "\n", + "The output of `tf.Transform` is exported as a TensorFlow graph which you can use for both training and serving. Using the same graph for both training and serving can prevent skew, since the same transformations are applied in both stages.\n", + "\n", + "Key Point: In order to understand `tf.Transform` and how it works with Apache Beam, you'll need to know a little bit about Apache Beam itself. The Beam Programming Guide is a great place to start." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "_tQUubddMvnP" + }, + "source": [ + "##What we're doing in this example\n", + "\n", + "In this example we'll be processing a widely used dataset containing census data, and training a model to do classification. Along the way we'll be transforming the data using `tf.Transform`.\n", + "\n", + "Key Point: As a modeler and developer, think about how this data is used and the potential benefits and harm a model's predictions can cause. A model like this could reinforce societal biases and disparities. Is a feature relevant to the problem you want to solve or will it introduce bias? For more information, read about ML fairness.\n", + "\n", + "Note: TensorFlow Model Analysis is a powerful tool for understanding how well your model predicts for various segments of your data, including understanding how your model may reinforce societal biases and disparities." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "RptgLn2RYuK3" + }, + "source": [ + "## Setup\n", + "First, we install the necessary packages, download data, import modules and set up paths.\n", + "\n", + "### Install TensorFlow, TensorFlow Transform, and Apache Beam\n", + "\n", + "> #### Note\n", + "> Because of some of the updates to packages you must use the menu button ( Runtime / Restart Runtime) to restart the runtime. Following restart, you should rerun this cell." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JZi8MTlV8qTM", + "colab_type": "code", + "colab": {} + }, + "source": [ + "!pip install -q -U tensorflow==2.0.0 tensorflow_transform==0.15.0 apache_beam pyarrow==0.14.1\n", + "!pip freeze | grep -e tensorflow -e pyarror -e apache_beam" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7_IV9SuOsteI", + "colab_type": "text" + }, + "source": [ + "It is necessary to restart the runtime (by click Runtime / Restart Runtime... button) so the new packages will take effect." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OT4zPmr9B8eH", + "colab_type": "text" + }, + "source": [ + "### Import packages\n", + "We import necessary packages, including Beam." + ] + }, + { + "cell_type": "code", + "metadata": { + "colab_type": "code", + "id": "K4QXVIM7iglN", + "colab": {} + }, + "source": [ + "import argparse\n", + "import logging\n", + "import os\n", + "import pprint\n", + "import tempfile\n", + "import urllib\n", + "import zipfile\n", + "\n", + "import tensorflow as tf\n", + "tf.get_logger().setLevel(logging.ERROR)\n", + "tf.get_logger().propagate = False\n", + "\n", + "import tensorflow_transform as tft\n", + "import tensorflow_transform.beam\n", + "import apache_beam as beam\n", + "\n", + "from tensorflow_transform.tf_metadata import dataset_metadata\n", + "from tensorflow_transform.tf_metadata import dataset_schema" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "qq3TD5WL8_oE", + "colab_type": "code", + "colab": {} + }, + "source": [ + "print('TensorFlow version: {}'.format(tf.__version__))\n", + "print('TensorFlow Transform version: {}'.format(tft.__version__))\n", + "print('Apache Beam version: {}'.format(beam.__version__))" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eTp9YWvm9hLw", + "colab_type": "text" + }, + "source": [ + "Download and split the census data:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q5Wh3Zi681U2", + "colab_type": "code", + "colab": {} + }, + "source": [ + "temp = tempfile.mkdtemp()\n", + "zip_, headers = urllib.request.urlretrieve('https://storage.googleapis.com/tfx-colab-datasets/census.zip')\n", + "zipfile.ZipFile(zip_).extractall(temp)\n", + "zipfile.ZipFile(zip_).close()\n", + "\n", + "train = os.path.join(temp, 'census/adult.data')\n", + "test = os.path.join(temp, 'census/adult.test')\n", + "print('Temp directory used is {}'.format(temp))" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "CxOxaaOYRfl7" + }, + "source": [ + "### Name our columns\n", + "We'll create some handy lists for referencing the columns in our dataset." + ] + }, + { + "cell_type": "code", + "metadata": { + "colab_type": "code", + "id": "-bsr1nLHqyg_", + "colab": {} + }, + "source": [ + "CATEGORICAL_FEATURE_KEYS = [\n", + " 'workclass',\n", + " 'education',\n", + " 'marital-status',\n", + " 'occupation',\n", + " 'relationship',\n", + " 'race',\n", + " 'sex',\n", + " 'native-country',\n", + "]\n", + "NUMERIC_FEATURE_KEYS = [\n", + " 'age',\n", + " 'capital-gain',\n", + " 'capital-loss',\n", + " 'hours-per-week',\n", + "]\n", + "OPTIONAL_NUMERIC_FEATURE_KEYS = [\n", + " 'education-num',\n", + "]\n", + "LABEL_KEY = 'label'" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "qtTn4at8rurk" + }, + "source": [ + "###Define our features and schema\n", + "Let's define a schema based on what types the columns are in our input. Among other things this will help with importing them correctly." + ] + }, + { + "cell_type": "code", + "metadata": { + "colab_type": "code", + "id": "5oS2RfyCrzMr", + "colab": {} + }, + "source": [ + "RAW_DATA_FEATURE_SPEC = dict(\n", + " [(name, tf.io.FixedLenFeature([], tf.string))\n", + " for name in CATEGORICAL_FEATURE_KEYS] +\n", + " [(name, tf.io.FixedLenFeature([], tf.float32))\n", + " for name in NUMERIC_FEATURE_KEYS] +\n", + " [(name, tf.io.VarLenFeature(tf.float32))\n", + " for name in OPTIONAL_NUMERIC_FEATURE_KEYS] +\n", + " [(LABEL_KEY, tf.io.FixedLenFeature([], tf.string))]\n", + ")\n", + "\n", + "RAW_DATA_METADATA = dataset_metadata.DatasetMetadata(\n", + " dataset_schema.from_feature_spec(RAW_DATA_FEATURE_SPEC))" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "zdXy9lo4t45d" + }, + "source": [ + "###Setting hyperparameters and basic housekeeping\n", + "Constants and hyperparameters used for training.\n", + "\n", + "Note: The number of instances will be computed by `tf.Transform` in future versions, in which case it can be read from the metadata." + ] + }, + { + "cell_type": "code", + "metadata": { + "colab_type": "code", + "id": "8WHyOkC9uL71", + "colab": {} + }, + "source": [ + "TRAIN_NUM_EPOCHS = 16\n", + "NUM_TRAIN_INSTANCES = 32561\n", + "TRAIN_BATCH_SIZE = 128\n", + "NUM_TEST_INSTANCES = 16281\n", + "\n", + "# Names of temp files\n", + "TRANSFORMED_TRAIN_DATA_FILEBASE = 'train_transformed'\n", + "TRANSFORMED_TEST_DATA_FILEBASE = 'test_transformed'\n", + "EXPORTED_MODEL_DIR = 'exported_model_dir'" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "9ev2P3cHcrM9" + }, + "source": [ + "## Cleaning\n", + "\n", + "### Create a Beam PTransform for cleaning our input data\n", + "\n", + "We'll create a **Beam PTransform** by creating a subclass of Apache Beam's `PTransform` class and overriding the `expand` method to specify the actual processing logic. A `PTransform` represents a data processing operation, or a step, in your pipeline. Every `PTransform` takes one or more `PCollection` objects as input, performs a processing function that you provide on the elements of that `PCollection`, and produces zero or more output PCollection objects.\n", + "\n", + "Our transform class will apply Beam's `ParDo` on the input `PCollection` containing our census dataset, producing clean data in an output `PCollection`.\n", + "\n", + "Key Point: The `expand` method of a `PTransform` is not meant to be invoked directly by the user of a transform. Instead, you should call the logical or operator (`|`) on the `PCollection` itself, with the transform as an argument. This allows transforms to be nested within the structure of your pipeline." + ] + }, + { + "cell_type": "code", + "metadata": { + "colab_type": "code", + "id": "uCpstiSCessP", + "colab": {} + }, + "source": [ + "class MapAndFilterErrors(beam.PTransform):\n", + " \"\"\"Like beam.Map but filters out erros in the map_fn.\"\"\"\n", + "\n", + " class _MapAndFilterErrorsDoFn(beam.DoFn):\n", + " \"\"\"Count the bad examples using a beam metric.\"\"\"\n", + "\n", + " def __init__(self, fn):\n", + " self._fn = fn\n", + " # Create a counter to measure number of bad elements.\n", + " self._bad_elements_counter = beam.metrics.Metrics.counter(\n", + " 'census_example', 'bad_elements')\n", + "\n", + " def process(self, element):\n", + " try:\n", + " yield self._fn(element)\n", + " except Exception: # pylint: disable=broad-except\n", + " # Catch any exception the above call.\n", + " self._bad_elements_counter.inc(1)\n", + "\n", + " def __init__(self, fn):\n", + " self._fn = fn\n", + "\n", + " def expand(self, pcoll):\n", + " return pcoll | beam.ParDo(self._MapAndFilterErrorsDoFn(self._fn))" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "0a1ns5KswDb2" + }, + "source": [ + "## Preprocessing with `tf.Transform`\n", + "\n", + "### Create a `tf.Transform` preprocessing_fn\n", + "\n", + "The _preprocessing function_ is the most important concept of tf.Transform. A preprocessing function is where the transformation of the dataset really happens. It accepts and returns a dictionary of tensors, where a tensor means a [`Tensor`](https://www.tensorflow.org/api_docs/python/tf/Tensor) or [`SparseTensor`](https://www.tensorflow.org/api_docs/python/tf/SparseTensor). There are two main groups of API calls that typically form the heart of a preprocessing function:\n", + "\n", + "1. **TensorFlow Ops:** Any function that accepts and returns tensors, which usually means TensorFlow ops. These add TensorFlow operations to the graph that transforms raw data into transformed data one feature vector at a time. These will run for every example, during both training and serving.\n", + "2. **TensorFlow Transform Analyzers:** Any of the analyzers provided by tf.Transform. Analyzers also accept and return tensors, but unlike TensorFlow ops they only run once, during analysis, and typically make a full pass over the entire training dataset. They create [tensor constants](https://www.tensorflow.org/api_docs/python/tf/constant), which are added to your graph. For example, `tft.min` computes the minimum of a tensor over the training dataset. tf.Transform provides a fixed set of analyzers, but this will be extended in future versions.\n", + "\n", + "It is worth noting that each Tensor/SparseTensor is a (mini)batch of values.\n", + "\n", + "Caution: When you apply your preprocessing function to serving inferences, the constants that were created by analyzers during analysis do not change. If your data has trend or seasonality components, plan accordingly." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MWzml7Dl04XQ", + "colab_type": "code", + "colab": {} + }, + "source": [ + "def preprocessing_fn(inputs):\n", + " \"\"\"Preprocess input columns into transformed columns.\"\"\"\n", + " # Since we are modifying some features and leaving others unchanged, we\n", + " # start by setting `outputs` to a copy of `inputs.\n", + " outputs = inputs.copy()\n", + "\n", + " # Scale numeric columns to have range [0, 1].\n", + " for key in NUMERIC_FEATURE_KEYS:\n", + " outputs[key] = tft.scale_to_0_1(outputs[key])\n", + "\n", + " for key in OPTIONAL_NUMERIC_FEATURE_KEYS:\n", + " # This is a SparseTensor because it is optional. Here we fill in a default\n", + " # value when it is missing.\n", + " dense = tf.sparse.to_dense(outputs[key], default_value=0.)\n", + " # Reshaping from a batch of vectors of size 1 to a batch to scalars.\n", + " dense = tf.squeeze(dense, axis=1)\n", + " outputs[key] = tft.scale_to_0_1(dense)\n", + "\n", + " # For all categorical columns except the label column, we generate a\n", + " # vocabulary but do not modify the feature. This vocabulary is instead\n", + " # used in the trainer, by means of a feature column, to convert the feature\n", + " # from a string to an integer id.\n", + " for key in CATEGORICAL_FEATURE_KEYS:\n", + " tft.vocabulary(inputs[key], vocab_filename=key)\n", + "\n", + " # For the label column we provide the mapping from string to index.\n", + " initializer = tf.lookup.KeyValueTensorInitializer(\n", + " keys=['>50K', '<=50K'],\n", + " values=tf.cast(tf.range(2), tf.int64),\n", + " key_dtype=tf.string,\n", + " value_dtype=tf.int64)\n", + " table = tf.lookup.StaticHashTable(initializer, default_value=-1)\n", + "\n", + " outputs[LABEL_KEY] = table.lookup(outputs[LABEL_KEY])\n", + "\n", + " return outputs\n" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "rgAGOAdFWRn2" + }, + "source": [ + "###Transform the data\n", + "Now we're ready to start transforming our data in an Apache Beam pipeline.\n", + "\n", + "1. Read in the data using the CSV reader\n", + "1. Clean it using our new `MapAndFilterErrors` transform\n", + "1. Transform it using a preprocessing pipeline that scales numeric data and converts categorical data from strings to int64 values indices, by creating a vocabulary for each category\n", + "1. Write out the result as a `TFRecord` of `Example` protos, which we will use for training a model later\n", + "" + ] + }, + { + "cell_type": "code", + "metadata": { + "colab_type": "code", + "id": "BQAPzdCRwi5d", + "colab": {} + }, + "source": [ + "def transform_data(train_data_file, test_data_file, working_dir):\n", + " \"\"\"Transform the data and write out as a TFRecord of Example protos.\n", + "\n", + " Read in the data using the CSV reader, and transform it using a\n", + " preprocessing pipeline that scales numeric data and converts categorical data\n", + " from strings to int64 values indices, by creating a vocabulary for each\n", + " category.\n", + "\n", + " Args:\n", + " train_data_file: File containing training data\n", + " test_data_file: File containing test data\n", + " working_dir: Directory to write transformed data and metadata to\n", + " \"\"\"\n", + "\n", + " # The \"with\" block will create a pipeline, and run that pipeline at the exit\n", + " # of the block.\n", + " with beam.Pipeline() as pipeline:\n", + " with tft.beam.Context(temp_dir=tempfile.mkdtemp()):\n", + " # Create a coder to read the census data with the schema. To do this we\n", + " # need to list all columns in order since the schema doesn't specify the\n", + " # order of columns in the csv.\n", + " ordered_columns = [\n", + " 'age', 'workclass', 'fnlwgt', 'education', 'education-num',\n", + " 'marital-status', 'occupation', 'relationship', 'race', 'sex',\n", + " 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',\n", + " 'label'\n", + " ]\n", + " converter = tft.coders.CsvCoder(ordered_columns, RAW_DATA_METADATA.schema)\n", + "\n", + " # Read in raw data and convert using CSV converter. Note that we apply\n", + " # some Beam transformations here, which will not be encoded in the TF\n", + " # graph since we don't do the from within tf.Transform's methods\n", + " # (AnalyzeDataset, TransformDataset etc.). These transformations are just\n", + " # to get data into a format that the CSV converter can read, in particular\n", + " # removing spaces after commas.\n", + " #\n", + " # We use MapAndFilterErrors instead of Map to filter out decode errors in\n", + " # convert.decode which should only occur for the trailing blank line.\n", + " raw_data = (\n", + " pipeline\n", + " | 'ReadTrainData' >> beam.io.ReadFromText(train_data_file)\n", + " | \"FilterNonEmptyRawData\" >> beam.Filter(lambda x: x.strip()!=\"\")\n", + " | 'FixCommasTrainData' >> beam.Map(\n", + " lambda line: line.replace(', ', ','))\n", + " | 'DecodeTrainData' >> MapAndFilterErrors(converter.decode))\n", + "\n", + " # Combine data and schema into a dataset tuple. Note that we already used\n", + " # the schema to read the CSV data, but we also need it to interpret\n", + " # raw_data.\n", + " raw_dataset = (raw_data, RAW_DATA_METADATA)\n", + " transformed_dataset, transform_fn = (\n", + " raw_dataset | tft.beam.AnalyzeAndTransformDataset(preprocessing_fn))\n", + " transformed_data, transformed_metadata = transformed_dataset\n", + " transformed_data_coder = tft.coders.ExampleProtoCoder(\n", + " transformed_metadata.schema)\n", + "\n", + " _ = (\n", + " transformed_data\n", + " | 'EncodeTrainData' >> beam.Map(transformed_data_coder.encode)\n", + " | 'WriteTrainData' >> beam.io.WriteToTFRecord(\n", + " os.path.join(working_dir, TRANSFORMED_TRAIN_DATA_FILEBASE)))\n", + "\n", + " # Now apply transform function to test data. In this case we remove the\n", + " # trailing period at the end of each line, and also ignore the header line\n", + " # that is present in the test data file.\n", + " raw_test_data = (\n", + " pipeline\n", + " | 'ReadTestData' >> beam.io.ReadFromText(test_data_file,\n", + " skip_header_lines=1)\n", + " | \"FilterNonEmptyTestData\" >> beam.Filter(lambda x: x.strip()!=\"\")\n", + " | 'FixCommasTestData' >> beam.Map(\n", + " lambda line: line.replace(', ', ','))\n", + " | 'RemoveTrailingPeriodsTestData' >> beam.Map(lambda line: line[:-1])\n", + " | 'DecodeTestData' >> MapAndFilterErrors(converter.decode))\n", + "\n", + " raw_test_dataset = (raw_test_data, RAW_DATA_METADATA)\n", + "\n", + " transformed_test_dataset = (\n", + " (raw_test_dataset, transform_fn) | tft.beam.TransformDataset())\n", + " # Don't need transformed data schema, it's the same as before.\n", + " transformed_test_data, _ = transformed_test_dataset\n", + "\n", + " _ = (\n", + " transformed_test_data\n", + " | 'EncodeTestData' >> beam.Map(transformed_data_coder.encode)\n", + " | 'WriteTestData' >> beam.io.WriteToTFRecord(\n", + " os.path.join(working_dir, TRANSFORMED_TEST_DATA_FILEBASE)))\n", + "\n", + " # Will write a SavedModel and metadata to working_dir, which can then\n", + " # be read by the tft.TFTransformOutput class.\n", + " _ = (\n", + " transform_fn\n", + " | 'WriteTransformFn' >> tft.beam.WriteTransformFn(working_dir))" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "TnaMyRMJ03bR" + }, + "source": [ + "##Using our preprocessed data to train a model\n", + "\n", + "To show how `tf.Transform` enables us to use the same code for both training and serving, and thus prevent skew, we're going to train a model. To train our model and prepare our trained model for production we need to create input functions. The main difference between our training input function and our serving input function is that training data contains the labels, and production data does not. The arguments and returns are also somewhat different." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "M8xCZKNc2wAS" + }, + "source": [ + "###Create an input function for training" + ] + }, + { + "cell_type": "code", + "metadata": { + "colab_type": "code", + "id": "kFO0MeWQ228a", + "colab": {} + }, + "source": [ + "def _make_training_input_fn(tf_transform_output, transformed_examples,\n", + " batch_size):\n", + " \"\"\"Creates an input function reading from transformed data.\n", + "\n", + " Args:\n", + " tf_transform_output: Wrapper around output of tf.Transform.\n", + " transformed_examples: Base filename of examples.\n", + " batch_size: Batch size.\n", + "\n", + " Returns:\n", + " The input function for training or eval.\n", + " \"\"\"\n", + " def input_fn():\n", + " \"\"\"Input function for training and eval.\"\"\"\n", + " dataset = tf.data.experimental.make_batched_features_dataset(\n", + " file_pattern=transformed_examples,\n", + " batch_size=batch_size,\n", + " features=tf_transform_output.transformed_feature_spec(),\n", + " reader=tf.data.TFRecordDataset,\n", + " shuffle=True)\n", + "\n", + " transformed_features = dataset.make_one_shot_iterator().get_next()\n", + "\n", + " # Extract features and label from the transformed tensors.\n", + " transformed_labels = transformed_features.pop(LABEL_KEY)\n", + "\n", + " return transformed_features, transformed_labels\n", + "\n", + " return input_fn" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "nYeuthrs27vl" + }, + "source": [ + "###Create an input function for serving\n", + "\n", + "Let's create an input function that we could use in production, and prepare our trained model for serving." + ] + }, + { + "cell_type": "code", + "metadata": { + "colab_type": "code", + "id": "NN5FVg343Jea", + "colab": {} + }, + "source": [ + "def _make_serving_input_fn(tf_transform_output):\n", + " \"\"\"Creates an input function reading from raw data.\n", + "\n", + " Args:\n", + " tf_transform_output: Wrapper around output of tf.Transform.\n", + "\n", + " Returns:\n", + " The serving input function.\n", + " \"\"\"\n", + " raw_feature_spec = RAW_DATA_FEATURE_SPEC.copy() #RAW_DATA_METADATA.schema.as_feature_spec()\n", + " # Remove label since it is not available during serving.\n", + " raw_feature_spec.pop(LABEL_KEY)\n", + "\n", + " def serving_input_fn():\n", + " \"\"\"Input function for serving.\"\"\"\n", + " # Get raw features by generating the basic serving input_fn and calling it.\n", + " # Here we generate an input_fn that expects a parsed Example proto to be fed\n", + " # to the model at serving time. See also\n", + " # tf.estimator.export.build_raw_serving_input_receiver_fn.\n", + " raw_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(\n", + " raw_feature_spec, default_batch_size=None)\n", + " serving_input_receiver = raw_input_fn()\n", + "\n", + " # Apply the transform function that was used to generate the materialized\n", + " # data.\n", + " raw_features = serving_input_receiver.features\n", + " transformed_features = tf_transform_output.transform_raw_features(\n", + " raw_features)\n", + "\n", + " return tf.estimator.export.ServingInputReceiver(\n", + " transformed_features, serving_input_receiver.receiver_tensors)\n", + "\n", + " return serving_input_fn" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Vc9Edp8A7dsI" + }, + "source": [ + "###Wrap our input data in FeatureColumns\n", + "Our model will expect our data in TensorFlow FeatureColumns." + ] + }, + { + "cell_type": "code", + "metadata": { + "colab_type": "code", + "id": "6qOFOvBk7oJX", + "colab": {} + }, + "source": [ + "def get_feature_columns(tf_transform_output):\n", + " \"\"\"Returns the FeatureColumns for the model.\n", + "\n", + " Args:\n", + " tf_transform_output: A `TFTransformOutput` object.\n", + "\n", + " Returns:\n", + " A list of FeatureColumns.\n", + " \"\"\"\n", + " # Wrap scalars as real valued columns.\n", + " real_valued_columns = [tf.feature_column.numeric_column(key, shape=())\n", + " for key in NUMERIC_FEATURE_KEYS]\n", + "\n", + " # Wrap categorical columns.\n", + " one_hot_columns = [\n", + " tf.feature_column.categorical_column_with_vocabulary_file(\n", + " key=key,\n", + " vocabulary_file=tf_transform_output.vocabulary_file_by_name(\n", + " vocab_filename=key))\n", + " for key in CATEGORICAL_FEATURE_KEYS]\n", + "\n", + " return real_valued_columns + one_hot_columns" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "LyNTX7CO8AAz" + }, + "source": [ + "##Train, Evaluate, and Export our model" + ] + }, + { + "cell_type": "code", + "metadata": { + "colab_type": "code", + "id": "8iGQ0jeq8IWr", + "colab": {} + }, + "source": [ + "def train_and_evaluate(working_dir, num_train_instances=NUM_TRAIN_INSTANCES,\n", + " num_test_instances=NUM_TEST_INSTANCES):\n", + " \"\"\"Train the model on training data and evaluate on test data.\n", + "\n", + " Args:\n", + " working_dir: Directory to read transformed data and metadata from and to\n", + " write exported model to.\n", + " num_train_instances: Number of instances in train set\n", + " num_test_instances: Number of instances in test set\n", + "\n", + " Returns:\n", + " The results from the estimator's 'evaluate' method\n", + " \"\"\"\n", + " tf_transform_output = tft.TFTransformOutput(working_dir)\n", + " run_config = tf.estimator.RunConfig()\n", + "\n", + " estimator = tf.estimator.LinearClassifier(\n", + " feature_columns=get_feature_columns(tf_transform_output),\n", + " config=run_config)\n", + "\n", + " # Fit the model using the default optimizer.\n", + " train_input_fn = _make_training_input_fn(\n", + " tf_transform_output,\n", + " os.path.join(working_dir, TRANSFORMED_TRAIN_DATA_FILEBASE + '*'),\n", + " batch_size=TRAIN_BATCH_SIZE)\n", + " estimator.train(\n", + " input_fn=train_input_fn,\n", + " max_steps=TRAIN_NUM_EPOCHS * num_train_instances / TRAIN_BATCH_SIZE)\n", + "\n", + " # Evaluate model on test dataset.\n", + " eval_input_fn = _make_training_input_fn(\n", + " tf_transform_output,\n", + " os.path.join(working_dir, TRANSFORMED_TEST_DATA_FILEBASE + '*'),\n", + " batch_size=1)\n", + "\n", + " # Export the model.\n", + " serving_input_fn = _make_serving_input_fn(tf_transform_output)\n", + " exported_model_dir = os.path.join(working_dir, EXPORTED_MODEL_DIR)\n", + " estimator.export_saved_model(exported_model_dir, serving_input_fn)\n", + "\n", + " return estimator.evaluate(input_fn=eval_input_fn, steps=num_test_instances)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "JrFJ9Zax8QAh" + }, + "source": [ + "## Put it all together\n", + "\n", + "We've created all the stuff we need to preprocess our census data, train a model, and prepare it for serving. So far we've just been getting things ready. It's time to start running!\n", + "\n", + ">### Note:\n", + "> Scroll the output from this cell to see the whole process. The results will be at the bottom." + ] + }, + { + "cell_type": "code", + "metadata": { + "colab_type": "code", + "id": "xe7JdSin86Ez", + "colab": {} + }, + "source": [ + "import time\n", + "\n", + "start = time.time()\n", + "\n", + "transform_data(train, test, temp)\n", + "print('Transform took {:.2f} seconds'.format(time.time() - start))\n", + "results = train_and_evaluate(temp)\n", + "print('Transform and training took {:.2f} seconds'.format(time.time() - start))\n", + "pprint.pprint(results)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "ICqetCnSjwp1" + }, + "source": [ + "##What we did\n", + "In this example we used `tf.Transform` to preprocess a dataset of census data, and train a model with the cleaned and transformed data. We also created an input function that we could use when we deploy our trained model in a production environment to perform inference. By using the same code for both training and inference we avoid any issues with data skew. Along the way we learned about creating an Apache Beam transform to perform the transformation that we needed for cleaing the data, and wrapped our data in TensorFlow `FeatureColumns`. This is just a small piece of what TensorFlow Transform can do! We encourage you to dive into `tf.Transform` and discover what it can do for you." + ] + } + ] +} \ No newline at end of file