Orc snappy compression

9/13/2023

The dictionary is sorted to speed up predicate filtering and improve compression ratios. Serialization of string columns uses a dictionary to form unique column values. Run-length encoding uses protobuf style variable-length integers. In run-length encoding, the first byte specifies run length and whether the values are literals or duplicates. the second byte (-128 to +127) is added between each repetition.To encode negative numbers, a zigzag encoding is used where 0, -1, 1, -2, and 2 map into 0, 1, 2, 3, 4, and 5 respectively. The variable-width encoding is based on Google’s protocol buffers and uses the high bit to represent whether this byte is not the last and the lower 7 bits to encode data. Values that differ by a constant in the range (-128 to 127) are run-length encoded.Repeated values are run-length encoded.Integers are encoded using a variable-width encoding that has fewer bytes for small integers.Integer data is serialized in a way that takes advantage of the common distribution of numbers: present bit stream: is the value non-null?.Integer columns are serialized in two streams. The serialization of column data in an ORC file depends on whether the data type is integer or string. Then looking for records in one state will skip the records of all other states. For example, if the primary partition is transaction date, the table can be sorted on state, zip code, and last name. With the ability to skip large sets of rows based on filter predicates, you can sort a table on its secondary keys to achieve a big reduction in execution time. By default every 10,000 rows can be skipped. Having relatively frequent row index entries enables row-skipping within a stripe for rapid reads, despite large stripe sizes. Note that ORC indexes are used only for the selection of stripes and row groups and not for answering queries. (A bit field or bloom filter could also be included.) Row index entries provide offsets that enable seeking to the right compression block and byte within a decompressed block. Index data includes min and max values for each column and the row positions within each column. The stripe footer contains a directory of stream locations. This diagram illustrates the ORC file structure:Īs shown in the diagram, each stripe in an ORC file holds index data, row data, and a stripe footer. It also contains column-level aggregates count, min, max, and sum. The file footer contains a list of stripes in the file, the number of rows per stripe, and each column’s data type. Large stripe sizes enable large, efficient reads from HDFS. At the end of the file a postscript holds compression parameters and the size of the compressed footer. An ORC file contains groups of row data called stripes, along with auxiliary information in a file footer.

0 Comments

Orc snappy compression

Leave a Reply.

Author

Archives

Categories