Apache Avro (Row-Based Serialization)
Avro is a row-based data serialization format that embeds its JSON schema within the file. It excels at schema evolution — readers and writers can have different but compatible schemas. Avro is the standard for Kafka message serialization and Hadoop data pipelines.
MIME Type
application/avro
Type
Binary
Compression
Lossless
Advantages
- + Schema evolution — add/remove fields without breaking readers
- + Compact binary encoding with efficient compression
- + Self-describing — schema embedded in the file
- + Standard in Kafka and the Hadoop ecosystem
Disadvantages
- − Row-based — less efficient than Parquet for analytical queries
- − Not human-readable in binary form
- − JSON schema specification has a learning curve
When to Use .AVRO
Use Avro for Kafka message schemas, Hadoop/Spark data pipelines, and any system where schema evolution and compact row storage are priorities.
Technical Details
Avro files contain a JSON schema header followed by binary-encoded data blocks compressed with DEFLATE or Snappy. Schema resolution at read time enables adding, removing, or renaming fields without breaking consumers.
History
Doug Cutting created Avro in 2009 as part of the Hadoop ecosystem. Unlike Thrift and Protocol Buffers, Avro was designed for dynamic schema resolution without code generation.