Snappy Algorithm

2021-10-182022-11-14 Embedded 35 minutes read (About 5269 words) 0 visits

How to analyse the source codes of snappy ? In this blog, the pseudocode of compression and uncompression using snappy is given, which is aimed to help you understand snappy algorithm.

Introduction

Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger. (For more information, see “Performance”, below.)

Snappy has the following properties:

Fast: Compression speeds at 250 MB/sec and beyond, with no assembler code. See “Performance” below.
Stable: Over the last few years, Snappy has compressed and decompressed petabytes of data in Google’s production environment. The Snappy bitstream format is stable and will not change between versions.
Robust: The Snappy decompressor is designed not to crash in the face of corrupted or malicious input.
Free and open source software: Snappy is licensed under a BSD-type license. For more information, see the included COPYING file.

Snappy has previously been called “Zippy” in some Google presentations and the like.

Snappy in RocksDB

How to link: https://github.com/facebook/rocksdb/blob/main/build_tools/build_detect_platform

if ! test $ROCKSDB_DISABLE_SNAPPY; then
        # Test whether Snappy library is installed
        # http://code.google.com/p/snappy/
        $CXX $PLATFORM_CXXFLAGS -x c++ - -o /dev/null 2>/dev/null  <<EOF
          #include <snappy.h>
          int main() {}

Where to use: https://github.com/facebook/rocksdb/blob/main/util/compression.h

inline bool Snappy_Compress(const CompressionInfo& /*info*/, const char* input,
                            size_t length, ::std::string* output) {
#ifdef SNAPPY
  output->resize(snappy::MaxCompressedLength(length));
  size_t outlen;
  snappy::RawCompress(input, length, &(*output)[0], &outlen);
  output->resize(outlen);
  return true;
#else
  (void)input;
  (void)length;
  (void)output;
  return false;
#endif
}

inline CacheAllocationPtr Snappy_Uncompress(
    const char* input, size_t length, size_t* uncompressed_size,
    MemoryAllocator* allocator = nullptr) {
#ifdef SNAPPY
  size_t uncompressed_length = 0;
  if (!snappy::GetUncompressedLength(input, length, &uncompressed_length)) {
    return nullptr;
  }

  CacheAllocationPtr output = AllocateBlock(uncompressed_length, allocator);

  if (!snappy::RawUncompress(input, length, output.get())) {
    return nullptr;
  }

  *uncompressed_size = uncompressed_length;

  return output;
#else
  (void)input;
  (void)length;
  (void)uncompressed_size;
  (void)allocator;
  return nullptr;
#endif
}

RocksDB主要调用了两个接口RawCompress和RawUncompress。

Snappy

Source: https://github.com/google/snappy/

首先看一下Format，然后分别从RawCompress和RawUncompress入手分析Snappy的压缩和解压过程。

Format

format_description.txt说明了一些编码格式。

Snappy compressed format description
Last revised: 2011-10-05


This is not a formal specification, but should suffice to explain most
relevant parts of how the Snappy format works. It is originally based on
text by Zeev Tarantov.

Snappy is a LZ77-type compressor with a fixed, byte-oriented encoding.
There is no entropy encoder backend nor framing layer -- the latter is
assumed to be handled by other parts of the system.

This document only describes the format, not how the Snappy compressor nor
decompressor actually works. The correctness of the decompressor should not
depend on implementation details of the compressor, and vice versa.


1. Preamble

The stream starts with the uncompressed length (up to a maximum of 2^32 - 1),
stored as a little-endian varint. Varints consist of a series of bytes,
where the lower 7 bits are data and the upper bit is set iff there are
more bytes to be read. In other words, an uncompressed length of 64 would
be stored as 0x40, and an uncompressed length of 2097150 (0x1FFFFE)
would be stored as 0xFE 0xFF 0x7F.


2. The compressed stream itself

There are two types of elements in a Snappy stream: Literals and
copies (backreferences). There is no restriction on the order of elements,
except that the stream naturally cannot start with a copy. (Having
two literals in a row is never optimal from a compression point of
view, but nevertheless fully permitted.) Each element starts with a tag byte,
and the lower two bits of this tag byte signal what type of element will
follow:

  00: Literal
  01: Copy with 1-byte offset
  10: Copy with 2-byte offset
  11: Copy with 4-byte offset

The interpretation of the upper six bits are element-dependent.


2.1. Literals (00)

Literals are uncompressed data stored directly in the byte stream.
The literal length is stored differently depending on the length
of the literal:

 - For literals up to and including 60 bytes in length, the upper
   six bits of the tag byte contain (len-1). The literal follows
   immediately thereafter in the bytestream.
 - For longer literals, the (len-1) value is stored after the tag byte,
   little-endian. The upper six bits of the tag byte describe how
   many bytes are used for the length; 60, 61, 62 or 63 for
   1-4 bytes, respectively. The literal itself follows after the
   length.


2.2. Copies

Copies are references back into previous decompressed data, telling
the decompressor to reuse data it has previously decoded.
They encode two values: The _offset_, saying how many bytes back
from the current position to read, and the _length_, how many bytes
to copy. Offsets of zero can be encoded, but are not legal;
similarly, it is possible to encode backreferences that would
go past the end of the block (offset > current decompressed position),
which is also nonsensical and thus not allowed.

As in most LZ77-based compressors, the length can be larger than the offset,
yielding a form of run-length encoding (RLE). For instance,
"xababab" could be encoded as

  <literal: "xab"> <copy: offset=2 length=4>

Note that since the current Snappy compressor works in 32 kB
blocks and does not do matching across blocks, it will never produce
a bitstream with offsets larger than about 32768. However, the
decompressor should not rely on this, as it may change in the future.

There are several different kinds of copy elements, depending on
the amount of bytes to be copied (length), and how far back the
data to be copied is (offset).


2.2.1. Copy with 1-byte offset (01)

These elements can encode lengths between [4..11] bytes and offsets
between [0..2047] bytes. (len-4) occupies three bits and is stored
in bits [2..4] of the tag byte. The offset occupies 11 bits, of which the
upper three are stored in the upper three bits ([5..7]) of the tag byte,
and the lower eight are stored in a byte following the tag byte.


2.2.2. Copy with 2-byte offset (10)

These elements can encode lengths between [1..64] and offsets from
[0..65535]. (len-1) occupies six bits and is stored in the upper
six bits ([2..7]) of the tag byte. The offset is stored as a
little-endian 16-bit integer in the two bytes following the tag byte.


2.2.3. Copy with 4-byte offset (11)

These are like the copies with 2-byte offsets (see previous subsection),
except that the offset is stored as a 32-bit integer instead of a
16-bit integer (and thus will occupy four bytes).

Compress

RawCompress

void RawCompress(const char* input, size_t input_length, char* compressed,
                 size_t* compressed_length) {
  ByteArraySource reader(input, input_length);
  UncheckedByteArraySink writer(compressed);
  Compress(&reader, &writer);

  // Compute how many bytes were added
  *compressed_length = (writer.CurrentDestination() - compressed);
}

首先根据参数创建reader，writer，然后调用Compress进行压缩，最后计算compressed_length。

下面看一下reader和writer的结构。

ByteArraySource

// A Source is an interface that yields a sequence of bytes
class Source {
 public:
  Source() { }
  virtual ~Source();

  // Return the number of bytes left to read from the source
  virtual size_t Available() const = 0;

  // Peek at the next flat region of the source.  Does not reposition
  // the source.  The returned region is empty iff Available()==0.
  //
  // Returns a pointer to the beginning of the region and store its
  // length in *len.
  //
  // The returned region is valid until the next call to Skip() or
  // until this object is destroyed, whichever occurs first.
  //
  // The returned region may be larger than Available() (for example
  // if this ByteSource is a view on a substring of a larger source).
  // The caller is responsible for ensuring that it only reads the
  // Available() bytes.
  virtual const char* Peek(size_t* len) = 0;

  // Skip the next n bytes.  Invalidates any buffer returned by
  // a previous call to Peek().
  // REQUIRES: Available() >= n
  virtual void Skip(size_t n) = 0;

 private:
  // No copying
  Source(const Source&);
  void operator=(const Source&);
};

// A Source implementation that yields the contents of a flat array
class ByteArraySource : public Source {
 public:
  ByteArraySource(const char* p, size_t n) : ptr_(p), left_(n) { }
  ~ByteArraySource() override;
  size_t Available() const override;
  const char* Peek(size_t* len) override;
  void Skip(size_t n) override;
 private:
  const char* ptr_;
  size_t left_;
};

Available: 表示还有多少个字节剩余。
Peek: 返回前面可以窥探到的字节流，并且返回长度。返回的buffer必须持续有效直到Skip。
Skip: 告诉Source某个部分的字节流已经不需要被使用了，将这一部分跳过。

UncheckedByteArraySink

// A Sink is an interface that consumes a sequence of bytes.
class Sink {
 public:
  Sink() { }
  virtual ~Sink();

  // Append "bytes[0,n-1]" to this.
  virtual void Append(const char* bytes, size_t n) = 0;

  // Returns a writable buffer of the specified length for appending.
  // May return a pointer to the caller-owned scratch buffer which
  // must have at least the indicated length.  The returned buffer is
  // only valid until the next operation on this Sink.
  //
  // After writing at most "length" bytes, call Append() with the
  // pointer returned from this function and the number of bytes
  // written.  Many Append() implementations will avoid copying
  // bytes if this function returned an internal buffer.
  //
  // If a non-scratch buffer is returned, the caller may only pass a
  // prefix of it to Append().  That is, it is not correct to pass an
  // interior pointer of the returned array to Append().
  //
  // The default implementation always returns the scratch buffer.
  virtual char* GetAppendBuffer(size_t length, char* scratch);

  // For higher performance, Sink implementations can provide custom
  // AppendAndTakeOwnership() and GetAppendBufferVariable() methods.
  // These methods can reduce the number of copies done during
  // compression/decompression.

  // Append "bytes[0,n-1] to the sink. Takes ownership of "bytes"
  // and calls the deleter function as (*deleter)(deleter_arg, bytes, n)
  // to free the buffer. deleter function must be non NULL.
  //
  // The default implementation just calls Append and frees "bytes".
  // Other implementations may avoid a copy while appending the buffer.
  virtual void AppendAndTakeOwnership(
      char* bytes, size_t n, void (*deleter)(void*, const char*, size_t),
      void *deleter_arg);

  // Returns a writable buffer for appending and writes the buffer's capacity to
  // *allocated_size. Guarantees *allocated_size >= min_size.
  // May return a pointer to the caller-owned scratch buffer which must have
  // scratch_size >= min_size.
  //
  // The returned buffer is only valid until the next operation
  // on this ByteSink.
  //
  // After writing at most *allocated_size bytes, call Append() with the
  // pointer returned from this function and the number of bytes written.
  // Many Append() implementations will avoid copying bytes if this function
  // returned an internal buffer.
  //
  // If the sink implementation allocates or reallocates an internal buffer,
  // it should use the desired_size_hint if appropriate. If a caller cannot
  // provide a reasonable guess at the desired capacity, it should set
  // desired_size_hint = 0.
  //
  // If a non-scratch buffer is returned, the caller may only pass
  // a prefix to it to Append(). That is, it is not correct to pass an
  // interior pointer to Append().
  //
  // The default implementation always returns the scratch buffer.
  virtual char* GetAppendBufferVariable(
      size_t min_size, size_t desired_size_hint, char* scratch,
      size_t scratch_size, size_t* allocated_size);

 private:
  // No copying
  Sink(const Sink&);
  void operator=(const Sink&);
};

// A Sink implementation that writes to a flat array without any bound checks.
class UncheckedByteArraySink : public Sink {
 public:
  explicit UncheckedByteArraySink(char* dest) : dest_(dest) { }
  ~UncheckedByteArraySink() override;
  void Append(const char* data, size_t n) override;
  char* GetAppendBuffer(size_t len, char* scratch) override;
  char* GetAppendBufferVariable(
      size_t min_size, size_t desired_size_hint, char* scratch,
      size_t scratch_size, size_t* allocated_size) override;
  void AppendAndTakeOwnership(
      char* bytes, size_t n, void (*deleter)(void*, const char*, size_t),
      void *deleter_arg) override;

  // Return the current output pointer so that a caller can see how
  // many bytes were produced.
  // Note: this is not a Sink method.
  char* CurrentDestination() const { return dest_; }
 private:
  char* dest_;
};

Append: 将bytes[0,n-1]这个字节流写入。
getAppendBuffer: 交出一块length的buffer，这块length的buffer的话必须一直有效直到Append被调用。当然我们也可以直接返回scratch(外围框架分配的内存)。

Compress

size_t Compress(Source* reader, Sink* writer) {
  size_t written = 0;
  size_t N = reader->Available();
  const size_t uncompressed_size = N;
  char ulength[Varint::kMax32];
  char* p = Varint::Encode32(ulength, N);
  writer->Append(ulength, p - ulength);
  written += (p - ulength);

  internal::WorkingMemory wmem(N);

  while (N > 0) {
    // Get next block to compress (without copying if possible)
    size_t fragment_size;
    const char* fragment = reader->Peek(&fragment_size);
    assert(fragment_size != 0);  // premature end of input
    const size_t num_to_read = std::min(N, kBlockSize);
    size_t bytes_read = fragment_size;

    size_t pending_advance = 0;
    if (bytes_read >= num_to_read) {
      // Buffer returned by reader is large enough
      pending_advance = num_to_read;
      fragment_size = num_to_read;
    } else {
      char* scratch = wmem.GetScratchInput();
      std::memcpy(scratch, fragment, bytes_read);
      reader->Skip(bytes_read);

      while (bytes_read < num_to_read) {
        fragment = reader->Peek(&fragment_size);
        size_t n = std::min<size_t>(fragment_size, num_to_read - bytes_read);
        std::memcpy(scratch + bytes_read, fragment, n);
        bytes_read += n;
        reader->Skip(n);
      }
      assert(bytes_read == num_to_read);
      fragment = scratch;
      fragment_size = num_to_read;
    }
    assert(fragment_size == num_to_read);

    // Get encoding table for compression
    int table_size;
    uint16_t* table = wmem.GetHashTable(num_to_read, &table_size);

    // Compress input_fragment and append to dest
    const int max_output = MaxCompressedLength(num_to_read);

    // Need a scratch buffer for the output, in case the byte sink doesn't
    // have room for us directly.

    // Since we encode kBlockSize regions followed by a region
    // which is <= kBlockSize in length, a previously allocated
    // scratch_output[] region is big enough for this iteration.
    char* dest = writer->GetAppendBuffer(max_output, wmem.GetScratchOutput());
    char* end = internal::CompressFragment(fragment, fragment_size, dest, table,
                                           table_size);
    writer->Append(dest, end - dest);
    written += (end - dest);

    N -= num_to_read;
    reader->Skip(pending_advance);
  }

  Report("snappy_compress", written, uncompressed_size);

  return written;
}

头部是原始串长度，使用变长整数方式Encode来编码。

inline char* Varint::Encode32(char* sptr, uint32_t v) {
  // Operate on characters as unsigneds
  uint8_t* ptr = reinterpret_cast<uint8_t*>(sptr);
  static const uint8_t B = 128;
  if (v < (1 << 7)) {
    *(ptr++) = static_cast<uint8_t>(v);
  } else if (v < (1 << 14)) {
    *(ptr++) = static_cast<uint8_t>(v | B);
    *(ptr++) = static_cast<uint8_t>(v >> 7);
  } else if (v < (1 << 21)) {
    *(ptr++) = static_cast<uint8_t>(v | B);
    *(ptr++) = static_cast<uint8_t>((v >> 7) | B);
    *(ptr++) = static_cast<uint8_t>(v >> 14);
  } else if (v < (1 << 28)) {
    *(ptr++) = static_cast<uint8_t>(v | B);
    *(ptr++) = static_cast<uint8_t>((v >> 7) | B);
    *(ptr++) = static_cast<uint8_t>((v >> 14) | B);
    *(ptr++) = static_cast<uint8_t>(v >> 21);
  } else {
    *(ptr++) = static_cast<uint8_t>(v | B);
    *(ptr++) = static_cast<uint8_t>((v>>7) | B);
    *(ptr++) = static_cast<uint8_t>((v>>14) | B);
    *(ptr++) = static_cast<uint8_t>((v>>21) | B);
    *(ptr++) = static_cast<uint8_t>(v >> 28);
  }
  return reinterpret_cast<char*>(ptr);
}

获取fragment和fragmentsize。

调用CompressFragment。

// Flat array compression that does not emit the "uncompressed length"
// prefix. Compresses "input" string to the "*op" buffer.
//
// REQUIRES: "input" is at most "kBlockSize" bytes long.
// REQUIRES: "op" points to an array of memory that is at least
// "MaxCompressedLength(input.size())" in size.
// REQUIRES: All elements in "table[0..table_size-1]" are initialized to zero.
// REQUIRES: "table_size" is a power of two
//
// Returns an "end" pointer into "op" buffer.
// "end - op" is the compressed size of "input".
namespace internal {
char* CompressFragment(const char* input, size_t input_size, char* op,
                       uint16_t* table, const int table_size) {
  // "ip" is the input pointer, and "op" is the output pointer.
  const char* ip = input;
  assert(input_size <= kBlockSize);
  assert((table_size & (table_size - 1)) == 0);  // table must be power of two
  const uint32_t mask = table_size - 1;
  const char* ip_end = input + input_size;
  const char* base_ip = ip;

  const size_t kInputMarginBytes = 15;
  if (SNAPPY_PREDICT_TRUE(input_size >= kInputMarginBytes)) {
    const char* ip_limit = input + input_size - kInputMarginBytes;

    for (uint32_t preload = LittleEndian::Load32(ip + 1);;) {
      // Bytes in [next_emit, ip) will be emitted as literal bytes.  Or
      // [next_emit, ip_end) after the main loop.
      const char* next_emit = ip++;
      uint64_t data = LittleEndian::Load64(ip);
      // The body of this loop calls EmitLiteral once and then EmitCopy one or
      // more times.  (The exception is that when we're close to exhausting
      // the input we goto emit_remainder.)
      //
      // In the first iteration of this loop we're just starting, so
      // there's nothing to copy, so calling EmitLiteral once is
      // necessary.  And we only start a new iteration when the
      // current iteration has determined that a call to EmitLiteral will
      // precede the next call to EmitCopy (if any).
      //
      // Step 1: Scan forward in the input looking for a 4-byte-long match.
      // If we get close to exhausting the input then goto emit_remainder.
      //
      // Heuristic match skipping: If 32 bytes are scanned with no matches
      // found, start looking only at every other byte. If 32 more bytes are
      // scanned (or skipped), look at every third byte, etc.. When a match is
      // found, immediately go back to looking at every byte. This is a small
      // loss (~5% performance, ~0.1% density) for compressible data due to more
      // bookkeeping, but for non-compressible data (such as JPEG) it's a huge
      // win since the compressor quickly "realizes" the data is incompressible
      // and doesn't bother looking for matches everywhere.
      //
      // The "skip" variable keeps track of how many bytes there are since the
      // last match; dividing it by 32 (ie. right-shifting by five) gives the
      // number of bytes to move ahead for each iteration.
      uint32_t skip = 32;

      const char* candidate;
      if (ip_limit - ip >= 16) {
        auto delta = ip - base_ip;
        for (int j = 0; j < 4; ++j) {
          for (int k = 0; k < 4; ++k) {
            int i = 4 * j + k;
            // These for-loops are meant to be unrolled. So we can freely
            // special case the first iteration to use the value already
            // loaded in preload.
            uint32_t dword = i == 0 ? preload : static_cast<uint32_t>(data);
            assert(dword == LittleEndian::Load32(ip + i));
            uint32_t hash = HashBytes(dword, mask);
            candidate = base_ip + table[hash];
            assert(candidate >= base_ip);
            assert(candidate < ip + i);
            table[hash] = delta + i;
            if (SNAPPY_PREDICT_FALSE(LittleEndian::Load32(candidate) == dword)) {
              *op = LITERAL | (i << 2);
              UnalignedCopy128(next_emit, op + 1);
              ip += i;
              op = op + i + 2;
              goto emit_match;
            }
            data >>= 8;
          }
          data = LittleEndian::Load64(ip + 4 * j + 4);
        }
        ip += 16;
        skip += 16;
      }
      while (true) {
        assert(static_cast<uint32_t>(data) == LittleEndian::Load32(ip));
        uint32_t hash = HashBytes(data, mask);
        uint32_t bytes_between_hash_lookups = skip >> 5;
        skip += bytes_between_hash_lookups;
        const char* next_ip = ip + bytes_between_hash_lookups;
        if (SNAPPY_PREDICT_FALSE(next_ip > ip_limit)) {
          ip = next_emit;
          goto emit_remainder;
        }
        candidate = base_ip + table[hash];
        assert(candidate >= base_ip);
        assert(candidate < ip);

        table[hash] = ip - base_ip;
        if (SNAPPY_PREDICT_FALSE(static_cast<uint32_t>(data) ==
                                LittleEndian::Load32(candidate))) {
          break;
        }
        data = LittleEndian::Load32(next_ip);
        ip = next_ip;
      }

      // Step 2: A 4-byte match has been found.  We'll later see if more
      // than 4 bytes match.  But, prior to the match, input
      // bytes [next_emit, ip) are unmatched.  Emit them as "literal bytes."
      assert(next_emit + 16 <= ip_end);
      op = EmitLiteral</*allow_fast_path=*/true>(op, next_emit, ip - next_emit);

      // Step 3: Call EmitCopy, and then see if another EmitCopy could
      // be our next move.  Repeat until we find no match for the
      // input immediately after what was consumed by the last EmitCopy call.
      //
      // If we exit this loop normally then we need to call EmitLiteral next,
      // though we don't yet know how big the literal will be.  We handle that
      // by proceeding to the next iteration of the main loop.  We also can exit
      // this loop via goto if we get close to exhausting the input.
    emit_match:
      do {
        // We have a 4-byte match at ip, and no need to emit any
        // "literal bytes" prior to ip.
        const char* base = ip;
        std::pair<size_t, bool> p =
            FindMatchLength(candidate + 4, ip + 4, ip_end, &data);
        size_t matched = 4 + p.first;
        ip += matched;
        size_t offset = base - candidate;
        assert(0 == memcmp(base, candidate, matched));
        if (p.second) {
          op = EmitCopy</*len_less_than_12=*/true>(op, offset, matched);
        } else {
          op = EmitCopy</*len_less_than_12=*/false>(op, offset, matched);
        }
        if (SNAPPY_PREDICT_FALSE(ip >= ip_limit)) {
          goto emit_remainder;
        }
        // Expect 5 bytes to match
        assert((data & 0xFFFFFFFFFF) ==
               (LittleEndian::Load64(ip) & 0xFFFFFFFFFF));
        // We are now looking for a 4-byte match again.  We read
        // table[Hash(ip, shift)] for that.  To improve compression,
        // we also update table[Hash(ip - 1, mask)] and table[Hash(ip, mask)].
        table[HashBytes(LittleEndian::Load32(ip - 1), mask)] = ip - base_ip - 1;
        uint32_t hash = HashBytes(data, mask);
        candidate = base_ip + table[hash];
        table[hash] = ip - base_ip;
        // Measurements on the benchmarks have shown the following probabilities
        // for the loop to exit (ie. avg. number of iterations is reciprocal).
        // BM_Flat/6  txt1    p = 0.3-0.4
        // BM_Flat/7  txt2    p = 0.35
        // BM_Flat/8  txt3    p = 0.3-0.4
        // BM_Flat/9  txt3    p = 0.34-0.4
        // BM_Flat/10 pb      p = 0.4
        // BM_Flat/11 gaviota p = 0.1
        // BM_Flat/12 cp      p = 0.5
        // BM_Flat/13 c       p = 0.3
      } while (static_cast<uint32_t>(data) == LittleEndian::Load32(candidate));
      // Because the least significant 5 bytes matched, we can utilize data
      // for the next iteration.
      preload = data >> 8;
    }
  }

emit_remainder:
  // Emit the remaining bytes as a literal
  if (ip < ip_end) {
    op = EmitLiteral</*allow_fast_path=*/false>(op, ip, ip_end - ip);
  }

  return op;
}
}  // end namespace internal

核心代码是for (uint32_t preload = LittleEndian::Load32(ip + 1);;)控制的大循环。
j和k控制两层for循环，指针每次向后移动1个byte（即内层循环k每次加1，data右移8位），对于当前指针指向的4bytes内容dword，将其放入hashtable中。

如果在循环中出现了candidata==dword的情况，则将从next_emit开始的16个bytes作为literal写入op，然后goto emit_match。

if (SNAPPY_PREDICT_FALSE(LittleEndian::Load32(candidate) == dword)) {
              *op = LITERAL | (i << 2);
              UnalignedCopy128(next_emit, op + 1);
              ip += i;
              op = op + i + 2;
              goto emit_match;
            }

否则，进入下面的while循环。

while (true) {
        assert(static_cast<uint32_t>(data) == LittleEndian::Load32(ip));
        uint32_t hash = HashBytes(data, mask);
        uint32_t bytes_between_hash_lookups = skip >> 5;
        skip += bytes_between_hash_lookups;
        const char* next_ip = ip + bytes_between_hash_lookups;
        if (SNAPPY_PREDICT_FALSE(next_ip > ip_limit)) {
          ip = next_emit;
          goto emit_remainder;
        }
        candidate = base_ip + table[hash];
        assert(candidate >= base_ip);
        assert(candidate < ip);

        table[hash] = ip - base_ip;
        if (SNAPPY_PREDICT_FALSE(static_cast<uint32_t>(data) ==
                                LittleEndian::Load32(candidate))) {
          break;
        }
        data = LittleEndian::Load32(next_ip);
        ip = next_ip;
      }

这就是注释中提到的启发式搜索，skip右移5位作为检查标准，不超过32bytes逐字节检查，超过32bytes不超过64bytes每两个字节检查一次…以此类推，bytes_between_hash_lookups的含义就是每多少个字节检查一次。最终会出现两种情况，一种是next_ip大于ip_limit，直接将其作为literal。另一种是data等于candidate，break跳出循环。

while循环结束后，我们得到了4bytes的match，先将match对应的literal写入op。

1 2	assert(next_emit + 16 <= ip_end); op = EmitLiteral</allow_fast_path=/true>(op, next_emit, ip - next_emit);

然后进入emit_match这个label标记的程序段。

emit_match:
      do {
        // We have a 4-byte match at ip, and no need to emit any
        // "literal bytes" prior to ip.
        const char* base = ip;
        std::pair<size_t, bool> p =
            FindMatchLength(candidate + 4, ip + 4, ip_end, &data);
        size_t matched = 4 + p.first;
        ip += matched;
        size_t offset = base - candidate;
        assert(0 == memcmp(base, candidate, matched));
        if (p.second) {
          op = EmitCopy</*len_less_than_12=*/true>(op, offset, matched);
        } else {
          op = EmitCopy</*len_less_than_12=*/false>(op, offset, matched);
        }
        if (SNAPPY_PREDICT_FALSE(ip >= ip_limit)) {
          goto emit_remainder;
        }
        // Expect 5 bytes to match
        assert((data & 0xFFFFFFFFFF) ==
               (LittleEndian::Load64(ip) & 0xFFFFFFFFFF));
        // We are now looking for a 4-byte match again.  We read
        // table[Hash(ip, shift)] for that.  To improve compression,
        // we also update table[Hash(ip - 1, mask)] and table[Hash(ip, mask)].
        table[HashBytes(LittleEndian::Load32(ip - 1), mask)] = ip - base_ip - 1;
        uint32_t hash = HashBytes(data, mask);
        candidate = base_ip + table[hash];
        table[hash] = ip - base_ip;
        // Measurements on the benchmarks have shown the following probabilities
        // for the loop to exit (ie. avg. number of iterations is reciprocal).
        // BM_Flat/6  txt1    p = 0.3-0.4
        // BM_Flat/7  txt2    p = 0.35
        // BM_Flat/8  txt3    p = 0.3-0.4
        // BM_Flat/9  txt3    p = 0.34-0.4
        // BM_Flat/10 pb      p = 0.4
        // BM_Flat/11 gaviota p = 0.1
        // BM_Flat/12 cp      p = 0.5
        // BM_Flat/13 c       p = 0.3
      } while (static_cast<uint32_t>(data) == LittleEndian::Load32(candidate));
      // Because the least significant 5 bytes matched, we can utilize data
      // for the next iteration.
      preload = data >> 8;

FindMatchLength求出最大的match长度，将offset和matched写入op，最后更新hashtable。如果data和candidate不相等，退出循环。

CompressFragment结束后，回到Compress中，最后通过writer->Append(dest, end - dest)写入writer。

Uncompress

RawUncompress

bool RawUncompress(const char* compressed, size_t compressed_length,
                   char* uncompressed) {
  ByteArraySource reader(compressed, compressed_length);
  return RawUncompress(&reader, uncompressed);
}

构造ByteArraySource，将reader作为参数调用重载的RawUncompress。

bool RawUncompress(Source* compressed, char* uncompressed) {
  SnappyArrayWriter output(uncompressed);
  return InternalUncompress(compressed, &output);
}

构造SnappyArrayWriter，将output作为参数调用InternalUncompress。

template <typename Writer>
static bool InternalUncompress(Source* r, Writer* writer) {
  // Read the uncompressed length from the front of the compressed input
  SnappyDecompressor decompressor(r);
  uint32_t uncompressed_len = 0;
  if (!decompressor.ReadUncompressedLength(&uncompressed_len)) return false;

  return InternalUncompressAllTags(&decompressor, writer, r->Available(),
                                   uncompressed_len);
}

通过source构造decompressor，获取uncompressed_len，调用InternalUncompressAllTags。

template <typename Writer>
static bool InternalUncompressAllTags(SnappyDecompressor* decompressor,
                                      Writer* writer, uint32_t compressed_len,
                                      uint32_t uncompressed_len) {
  Report("snappy_uncompress", compressed_len, uncompressed_len);

  writer->SetExpectedLength(uncompressed_len);

  // Process the entire input
  decompressor->DecompressAllTags(writer);
  writer->Flush();
  return (decompressor->eof() && writer->CheckLength());
}

在writer中设置uncompressed_len，通过decompressor的DecompressAllTags(writer)进行解压。

void
  DecompressAllTags(Writer* writer) {
    const char* ip = ip_;
    ResetLimit(ip);
    auto op = writer->GetOutputPtr();
    // We could have put this refill fragment only at the beginning of the loop.
    // However, duplicating it at the end of each branch gives the compiler more
    // scope to optimize the <ip_limit_ - ip> expression based on the local
    // context, which overall increases speed.
#define MAYBE_REFILL()                                      \
  if (SNAPPY_PREDICT_FALSE(ip >= ip_limit_min_maxtaglen_)) { \
    ip_ = ip;                                               \
    if (SNAPPY_PREDICT_FALSE(!RefillTag())) goto exit;       \
    ip = ip_;                                               \
    ResetLimit(ip);                                         \
  }                                                         \
  preload = static_cast<uint8_t>(*ip)

    // At the start of the for loop below the least significant byte of preload
    // contains the tag.
    uint32_t preload;
    MAYBE_REFILL();
    for (;;) {
      {
        ptrdiff_t op_limit_min_slop;
        auto op_base = writer->GetBase(&op_limit_min_slop);
        if (op_base) {
          auto res =
              DecompressBranchless(reinterpret_cast<const uint8_t*>(ip),
                                   reinterpret_cast<const uint8_t*>(ip_limit_),
                                   op - op_base, op_base, op_limit_min_slop);
          ip = reinterpret_cast<const char*>(res.first);
          op = op_base + res.second;
          MAYBE_REFILL();
        }
      }
      const uint8_t c = static_cast<uint8_t>(preload);
      ip++;

      // Ratio of iterations that have LITERAL vs non-LITERAL for different
      // inputs.
      //
      // input          LITERAL  NON_LITERAL
      // -----------------------------------
      // html|html4|cp   23%        77%
      // urls            36%        64%
      // jpg             47%        53%
      // pdf             19%        81%
      // txt[1-4]        25%        75%
      // pb              24%        76%
      // bin             24%        76%
      if (SNAPPY_PREDICT_FALSE((c & 0x3) == LITERAL)) {
        size_t literal_length = (c >> 2) + 1u;
        if (writer->TryFastAppend(ip, ip_limit_ - ip, literal_length, &op)) {
          assert(literal_length < 61);
          ip += literal_length;
          // NOTE: There is no MAYBE_REFILL() here, as TryFastAppend()
          // will not return true unless there's already at least five spare
          // bytes in addition to the literal.
          preload = static_cast<uint8_t>(*ip);
          continue;
        }
        if (SNAPPY_PREDICT_FALSE(literal_length >= 61)) {
          // Long literal.
          const size_t literal_length_length = literal_length - 60;
          literal_length =
              ExtractLowBytes(LittleEndian::Load32(ip), literal_length_length) +
              1;
          ip += literal_length_length;
        }

        size_t avail = ip_limit_ - ip;
        while (avail < literal_length) {
          if (!writer->Append(ip, avail, &op)) goto exit;
          literal_length -= avail;
          reader_->Skip(peeked_);
          size_t n;
          ip = reader_->Peek(&n);
          avail = n;
          peeked_ = avail;
          if (avail == 0) goto exit;
          ip_limit_ = ip + avail;
          ResetLimit(ip);
        }
        if (!writer->Append(ip, literal_length, &op)) goto exit;
        ip += literal_length;
        MAYBE_REFILL();
      } else {
        if (SNAPPY_PREDICT_FALSE((c & 3) == COPY_4_BYTE_OFFSET)) {
          const size_t copy_offset = LittleEndian::Load32(ip);
          const size_t length = (c >> 2) + 1;
          ip += 4;

          if (!writer->AppendFromSelf(copy_offset, length, &op)) goto exit;
        } else {
          const ptrdiff_t entry = kLengthMinusOffset[c];
          preload = LittleEndian::Load32(ip);
          const uint32_t trailer = ExtractLowBytes(preload, c & 3);
          const uint32_t length = entry & 0xff;
          assert(length > 0);

          // copy_offset/256 is encoded in bits 8..10.  By just fetching
          // those bits, we get copy_offset (since the bit-field starts at
          // bit 8).
          const uint32_t copy_offset = trailer - entry + length;
          if (!writer->AppendFromSelf(copy_offset, length, &op)) goto exit;

          ip += (c & 3);
          // By using the result of the previous load we reduce the critical
          // dependency chain of ip to 4 cycles.
          preload >>= (c & 3) * 8;
          if (ip < ip_limit_min_maxtaglen_) continue;
        }
        MAYBE_REFILL();
      }
    }
#undef MAYBE_REFILL
  exit:
    writer->SetOutputPtr(op);
  }

Pseudocode

Performance

Snappy is intended to be fast. On a single core of a Core i7 processor in 64-bit mode, it compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more. (These numbers are for the slowest inputs in our benchmark suite; others are much faster.) In our tests, Snappy usually is faster than algorithms in the same class (e.g. LZO, LZF, QuickLZ, etc.) while achieving comparable compression ratios.

Typical compression ratios (based on the benchmark suite) are about 1.5-1.7x for plain text, about 2-4x for HTML, and of course 1.0x for JPEGs, PNGs and other already-compressed data. Similar numbers for zlib in its fastest mode are 2.6-2.8x, 3-7x and 1.0x, respectively. More sophisticated algorithms are capable of achieving yet higher compression rates, although usually at the expense of speed. Of course, compression ratio will vary significantly with the input.

Although Snappy should be fairly portable, it is primarily optimized for 64-bit x86-compatible processors, and may run slower in other environments.
In particular:

Snappy uses 64-bit operations in several places to process more data at once than would otherwise be possible.
Snappy assumes unaligned 32 and 64-bit loads and stores are cheap. On some platforms, these must be emulated with single-byte loads and stores, which is much slower.
Snappy assumes little-endian throughout, and needs to byte-swap data in several places if running on a big-endian platform.

Experience has shown that even heavily tuned code can be improved. Performance optimizations, whether for 64-bit x86 or other platforms, are of course most welcome; see “Contact”, below.

Reference

Snappy Algorithm

https://tong1heng.github.io/2021/10/18/Embedded/snappy/

Author

Yiheng Tong

Posted on

2021-10-18

Updated on

2022-11-14

Licensed under

Snappy Algorithm

Introduction

Snappy in RocksDB

Snappy

Format

Compress

Uncompress

Pseudocode

Performance

Reference

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Comments

Catalogue