Cloudera Enterprise 5.15.x | Other versions

File Formats and Compression

CDH supports all standard Hadoop file formats. For information about the file formats, see the File-Based Data Structures section of the Hadoop I/O chapter in Hadoop: The Definitive Guide.

The file format has a significant impact on performance. Use Avro if your use case typically scans or retrieves all of the fields in a row in each query. Parquet is a better choice if your dataset has many columns, and your use case typically involves working with a subset of those columns instead of entire records. For more information, see this Parquet versus Avro benchmark study.

Important:

The configuration property serialization.null.format is set in Hive and Impala engines as SerDes or table properties to specify how to serialize/deserialize NULL values into a storage format.

This configuration option is suitable for text file formats only. If used with binary storage formats such as RCFile or Parquet, the option causes compatibility, complexity and efficiency issues.

All file formats include support for compression, which affects the size of data on the disk and, consequently, the amount of I/O and CPU resources required to serialize and deserialize data.

Continue reading:

Page generated May 18, 2018.