Skip to main content

Parquet format support

info

Support for the Parquet format is currently experimental. Contact us at the SquidDevs Telegram channel for support.

Table Implementation

Apache Parquet is an advanced format for storing tabular data in files. It divides table columns into column chunks. Each column chunk is stored contiguously, allowing efficient partial reads of column subsets. Column chunks can also be compressed with row-specific compression algorithms, further enhancing the performance. Retrieval relies on metadata appended to the end of a Parquet file. Metadata standard of Apache Parquet is extremely powerful, enabling all sorts of extensions. Among other things, metadata contains the schema of the data, making the format self-describing.

The @subsquid/file-store-parquet package provides a Table implementation for writing to Parquet files. Use it by supplying one or more of its instances via the tables field of the Database constructor argument. Constructor of the Table implementation accepts the following arguments:

  • fileName: string: the name of the output file in every dataset partition folder.
  • schema: {[column: string]: ColumnData}: a mapping from Parquet column names to ColumnData objects. A mapping of the same keys to data values is the row type used by the table writer.
  • options?: TableOptions: see Table Options.

Columns

ColumnData objects define storage options for each table column. They are made with the Column factory function that accepts a column data type and an optional options: ColumnOptions object.

Column types can be obtained by making the function calls listed below from the Types submodule. They determine the Parquet type that will be used to store the data and the type that the table writer will expect to find at the corresponding field of data row objects.

Column typeLogical typePrimitive typeValid data row object field contents
Types.String()variable length stringBYTE_ARRAYstring of any length
Types.Binary (length?)variable or fixed length byte arrayBYTE_ARRAY or FIXED_LEN_ BYTE_ARRAYUint8Array of length equal to length if it is set or of any length otherwise
Types.Int8()8-bit signed integerINT32number from -128 to 127
Types.Int16()16-bit signed integerINT32number from -32768 to 32767
Types.Int32()32-bit signed integerINT32number from -2147483648 to 2147483647
Types.Int64()64-bit signed integerINT64bigint or number from -9223372036854775808 to 9223372036854775807
Types.Uint8()8-bit unsigned integerINT32number from 0 to 255
Types.Uint16()16-bit unsigned integerINT32number from 0 to 65535
Types.Uint32()32-bit unsigned integerINT32number from 0 to 4294967295
Types.Uint64()64-bit unsigned integerINT64bigint or number from 0 to 18446744073709551615
Types.Float()32-bit floating point numberFLOATnon-Nan number
Types.Double()64-bit floating point numberDOUBLEnon-Nan number
Types.Boolean()boolean valueBOOLEANboolean
Types.Timestamp()UNIX timestamp in millisecondsINT64Date
Types.Decimal (precision, scale=0)decimal with precision digits and scale digits to the right of the decimal pointINT32 or INT64 or FIXED_LEN_ BYTE_ARRAYnumber or bigint or BigDecimal
Types.List (itemType, {nullable=false})a list filled with optionally nullable items of itemType column type-Array of items satisfying itemType
Types.JSON<T extends {[k: string]: any}>()JSON object of type TBYTE_ARRAYObject of type T
Types.BSON<T extends {[k: string]: any}>()BSON object of type TBYTE_ARRAYObject of type T
tip

The widest decimals that PyArrow can read are Types.Decimal(76).

The following column options are available:

ColumnOptions {
nullable?: boolean
compression?: Compression
encoding?: Encoding
}

See the Encoding and Compression section for details.

Table Options

As its optional final argument, the constructor of Table accepts an object that defines table options:

TableOptions {
compression?: Compression
rowGroupSize?: number
pageSize?: number
}

Here,

  • compression determines the file-wide compression algorithm. Per-column settings override this. See Encoding and Compression for the list of available algorithms. Default: Compression.UNCOMPRESSED.
  • rowGroupSize determines the approximate uncompressed size of the row group in bytes. Default: 32 * 1024 * 1024.
  • pageSize determines the approximate uncompressed page size in bytes. Default: 8 * 1024.

When pageSize is less than rowGroupSize times the number of columns, the latter setting will be ignored. In this case each row group will contain exactly one roughly pageSized page for each column.

Encoding and Compression

Encodings are set at a per-column basis. At the moment the default and the only supported value is 'PLAIN'.

Compression can be set at a per-file or a per-column basis. Available values are

  • 'UNCOMPRESSED' (default)
  • 'GZIP'
  • 'LZO'
  • 'BROTLI'
  • 'LZ4'

Example

This saves ERC20 Transfer events captured by the processor to a Parquet file. All columns except for from are GZIPped. Row groups are set to be roughly 30000 bytes in size each. Each row group contains roughly ten ~1000 bytes-long pages per column. Full squid code is available in this repo.

import {Database, LocalDest} from '@subsquid/file-store'
import {
Column,
Table,
Types
} from '@subsquid/file-store-parquet'

...

const dbOptions = {
tables: {
TransfersTable: new Table(
'transfers.parquet',
{
from: Column(
Types.String(),
{
compression: 'UNCOMPRESSED'
}
),
to: Column(Types.String()),
value: Column(Types.Uint64())
},
{
compression: 'GZIP',
rowGroupSize: 300000,
pageSize: 1000
}
)
},
dest: new LocalDest('./data'),
chunkSizeMb: 10
}

processor.run(new Database(dbOptions), async (ctx) => {
...
let from: string = ...
let to: string = ...
let value: bigint = ...
ctx.store.TransfersTable.write({ from, to, value })
...
})