Evaluation of PNGX ideas

Kirinn Bunnylin,

Expanded on

This describes a variety of trivial additions that could be made to the PNG image format, improving its performance without compromising its simplicity, and seeking to maintain backward compatibility.

Rationale: Improvements in compression algorithms have made it possible to outperform DEFLATE and PNG compression. However, the state of the art formats that beat PNG are typically unnecessarily complex and hard to reimplement. Both WebP and AVIF do outperform PNG on compression, but include a lossless and lossy algorithm, rather than doing just one thing well. This increases the complexity of a full reimplementation. A true upgrade to PNG must satisfy the following requirements:

I propose a PNGX format that makes minimal high-impact extensions to PNG/DEFLATE, retaining full backward-compatibility with PNG while being able to choose a more efficient modified format through the same compressor/decompressor.

Replacing DEFLATE

Zstd is well-known to compress better and run faster than DEFLATE. It has become widely adopted, and appears to have a tolerable license grant, making it acceptable to use in free software.

Adding zstd as compression algorithm "1" would be an easy win. This doesn't affect image pre- or post-processing, or the filtering step; the only change needed is to dispatch the bytestream to a different compression library when compression value is 1.

Note (): After investigating zstd for suitability, there is a barrier to adoption that is a dealbreaker: code complexity. Aside from reading the zstd decoding-only specification, the library binary size already speaks for itself: On current Arch Linux, libdeflate is 74 kb, while libzstd is 842 kb, an order of magnitude larger.

The zstd algorithm is so complex that virtually no one in the world understands the entire thing, let alone can reimplement it. Most programs simply link to the reference implementation, possibly through API bindings, and there are at present only 3 full ports (Java, C#, and Go), likely made with blind code translation rather than fully comprehending and reimplementing the algorithm. If the best hackers of humanity haven't been able to reimplement this more than 3 times over the 7 years since zstd's publication (2016) despite its high profile, it would be irresponsible to call zstd a "standard", or use it in PNGX. In other words, zstd compromises the simplicity, understandability, and therefore maintainability of any system it is part of.

Other compression algorithms must be considered, but in order to retain simplicity and maintainability, only one new algorithm should actually be defined for PNGX.

A relevant algorithm benchmark site: Squash Compression Benchmark

An appealing possibility would be to simply replace DEFLATE's Huffman-encoding with FSE-encoding, which is the largest innovation in zstd. Apple did something like this for their LZFSE algorithm, although its LZ implementation is not the same as DEFLATE's. LZFSE is not formally documented at all still, but its source code is quite small and straightforward, and a simple build reveals liblzfse is about 34 kb, even smaller and thus simpler than libdeflate! Evidently Apple also does not hold patents over LZFSE, making it probably fine for wide adoption.

One caveat is that Apple's reference implementation is not properly optimised, so it doesn't outperform a properly-optimised gzip as much as it could, or sometimes at all. Making a modified DEFLATE+FSE instead would be able to take advantage of the existing well-optimised LZ half of the algorithm, only needing to use or port the well-optimised FSE reference library. This also has the advantage of retaining backward compatibility with DEFLATE+HUFFMAN, assuming the existing implementation is extended rather than replaced. An existing bit in the DEFLATE header would have to be repurposed to identify if FSE is being used; perhaps BTYPE 11.

An interesting read: Martin Hron's thesis on LZFSE (2018).

Dictionary support

For greatly improving storage of a collection of images, such as for a game, pre-defined dictionary support could make a big difference. However, since this entails managing an extra shared dictionary file, it would complicate the image format too much. A game developer may want to add it in a custom implementation, but it's inappropriate for PNGX at the specification level.

New filters: Difference from more locations

The original filter set only has a difference predictor for the byte to the left and directly above, and an average of the two. Basic filters like this are cheap to add, so we may as well use the full set from WebP. The WebP license has a patent grant intended to allow free use and modification by anyone not patent-litigating against them, so I would cautiously consider these as acceptable for free software.

Two additional filters for images with ordered dithering could be the byte from four bytes to the left, and four bytes above. If above goes out of image, it should not be used by the encoder, but if present, acts as filter 0 direct copy. The four bytes to the left one should copy from directly above for the leftmost four bytes, which becomes a direct copy if used for the first four pixels of the image.

New filters:

Averages and clamps are mainly useful for images where colors follow a natural gradient order, truecolor or grayscale. For indexed color images, it may be best to just ignore the average filters when encoding.

New filter: Median adaptive predictor (MAP or MED)

A median filter is a powerful addition beside the existing Paeth filter. This was described by Martucci in 1990 so definitely free to use by now, and is computationally comparable to Paeth. This is the same as the JPEG-LS nonlinear predictor.

The neighboring bytes are here notated with compass directions.

if nw >= max(w, n) then P := min(w, n)
else if nw <= min(w, n) then P := max(w, n)
else P := w + n - nw

This could be useful in all color modes, including indexed.

New filter: Gradient adjusted predictor (GAP)

This is used in the CALIC codec from 1997, by Wu and Memon, surely free to use by now.

It has 7 source bytes instead of the 3 used by MAP. The neighboring bytes are defined as steps in compass directions: w, nw, n, ne, ww, nn, and nne. If any such byte is outside image bounds, substitute with a 0 value.

Let dh := abs(w - ww) + abs(n - nw) + abs(n - ne)
Let dv := abs(w - nw) + abs(n - nn) + abs(ne - nne)
Let D := dv - dh

if D > 80 then P := w
else if D < -80 then P := n
else begin
P := (w + n) / 2 + (ne - nw) / 4
if D > 32 then P := (P + w) / 2
else if D > 8 then P := (3 * P + w) / 4
else if D < -32 then P := (P + n) / 2
else if D < -8 then P := (3 * P + n) / 4
end

Since this makes guesses based on gradients, it works best on truecolor and grayscale images, where colors are in natural gradient order. Indexed-color images can have such a palette order too, but probably don't. You could analyse the gradientness of the palette and decide based on that whether to test GAP during encoding, but it's easier to never use GAP when encoding indexed-color images.

New filter: Gradient edge detection (GED)

GED is positioned halfway between MAP and GAP, published by Avramovic and Reljin in 2010. It's not clear to me if this is patented, so it can't be used without further investigation.

This uses 5 source bytes, here notated with compass directions. T is a configurable threshold value, which is suggested should default to 8.

Let gh := abs(ww - w) + abs(nw - n)
Let gv := abs(nn - n) + abs(nw - w)
Let G := gv - gh

if G > T then P := w
else if G < T then P := n
else P := 3 * (w + n) / 8 + (nw + ww + nn) / 12

New filter: Subtract green

Another simple transformation from WebP, this subtracts green from the red and blue bytes when encoding, leaving alpha and green itself unchanged. When decoding, green is added back. Since green correlates somewhat with luma, and luma somewhat correlates with all colors, this operation partially eliminates brightness gradients from the image for improved compressibility.

This only makes sense in truecolor mode, and may combine well with another filter applied afterward during encoding. (Adding green back in decoding is done after defiltering.)

New filter: Nibble Flip

Indexed-color images, particularly at 4bpp or less, often use ordered dithering, variations of a checkerboard pattern. PNG does a fair job compressing this, but if the pattern can be simplified, it may work even better.

I propose adding a Nibble Flip Bit as the 2nd MSB of the row predictor byte. If set, the predicted pixels on the row are nibble-flipped after calculation (likely flipping all pixels on the row in one go for efficiency, after predicting all pixels on the row). This does increase encoding complexity somewhat, but the Nibble Flip only makes sense with indexed-color images at 4bpp or less, so truecolor and 8bpp encoding would be unaffected.

It's only called Nibble Flip because it sounds funny, but actually it should swap all pixel pairs within a byte. At 4bpp, swap the top and bottom nibbles; at 2bpp, swap the bit pairs within both nibbles; at 1bpp, swap all bit pairs.

I assume this has already been invented and possibly patented elsewhere, so caution is advised before using this.

Note (): In practice this probably is an overengineered version of simple XOR by above row or XOR by left pixel. XOR filtering has the benefit of prior art from at least the 90's in the Maki-chan image format, if not before, making it safe to use.

Predictor for the predictor

Instead of storing a predictor byte at the start of each row, the active predictor could be derived from evaluating what would've been the best predictor for the last byte or several bytes. Start with filter 0, then for each byte in sequence after the first pixel, if the active filter predicted the last pixel correctly then keep the active filter; otherwise switch to whichever filter predicted the most correct pixels over the last window of interest. Basically any-size sliding window can be used without performance impact. Likely window sizes range from 1 up to distance from start of row. If multiple predictors tie for best, use the lowest-numbered one.

This would break backward compatibility, requiring non-trivial encoder and decoder changes, so it's out of scope for PNGX. There are also some pathological cases where this would perform worse than the best per-row predictor. But this would remove the need to test different row predictor combinations when aiming for best compression, to some degree actually simplifying the encoder.

I assume this has already been invented and possibly patented elsewhere, so caution is advised before using this.

Sort palette indexes

When a prediction is missed, the compression algorithm has to encode the difference between the prediction and real value. If the palette is randomly distributed, those difference values are likely to be large as well. Sorting the palette may allow minimising difference values.

The simplest way to go is counting how many pixels of each indexed color are present in the image, then sort the palette with the most frequent colors at the bottom. A possible further improvement would be to sort each color as close as possible to colors it is most commonly next to in the image.

Keep a table of all colors * all colors. For each byte X in the image, where the byte to its left is A and the byte directly above is B, add +1 to table[X][A] and table[X][B]. For the top row, ignore B; for the leftmost byte, ignore A. At end of image, use the table to sort the palette.

Using the table could entail calculating a weighted average from each [X] to all its neighbors' current palette indexes, distance weighted by total share of that neighbor relative to all neighbors of [X]. Repeat this process at least once, but probably a few times, and the palette should settle close to optimal order.

PNG's bytewise filtering means palette reordering is only likely to be useful for 8bpp indexed images.

Since palette order can be important, it is not safe to rearrange the palette in every image, so this can't be part of PNGX. However, game developers could make use of this as a custom addition, knowing which images can allow palette reordering. This can cause a possibly significant encoding slowdown, since the image must be iterated an extra time and the final sorting needs to calculate a lot of numbers, but there would be no changes needed or slowdown in decoding.

I assume this has already been invented and possibly patented elsewhere, so caution is advised before using this.

Double-filtering

Added on ().

If one filter improves image compressibility, sometimes a second filter applied afterward can make it even better. This has two advantages: firstly, determining whether to apply a second round of filtering and actually applying it can be done by simply running the existing filtering code twice, making it a very simple upgrade; secondly, if the total number of filters is kept to 15 or less, the second filter round can be stored in the high nibble of each row's filter byte. If a second filter is not defined, the top nibble is 0, which it always is for normal PNG images, guaranteeing perfect backward compatibility within the same encoder and decoder.

Double-filtering would only be reasonable at higher compression levels, and would somewhat increase compression time, but decompression speed would be nearly unchanged.

New RGBA palette chunk

Added on ().

Normal PNG requires the palette to be stored in the PLTE chunk without alpha, and then alpha values separately in the tRNS chunk. It would be more sensible to define a new chunk with RGBA palette values, that combines PLTE and tRNS.

This would only save 12 bytes (the tRNS chunk's header and CRC), but it would save some effort since you could build the palette fully from the single chunk. However, since backward compatibility requires retaining support for PLTE and tRNS, the added complexity is probably not worth it for PNGX, although game developers might enjoy having it in a custom implementation.

Faster CRC32

Added on ().

Existing PNG optimisers already make good use of this. The CRC32 algorithm recommended by the PNG definition is not optimally fast. The specification should recommend finding the best available CRC32 algorithm with identical outputs. For reference, see Stephan Brumme's Fast CRC32.

Game developers may want to skip the CRC32 calculation entirely. If an image is corrupted, you don't need the CRC32, just observe the bad output or decoder error.