对我来说,使用 UTF-8 作为目标来压缩字符串似乎不合理......看起来只是在寻找麻烦。我认为如果在线尺寸很重要,最好放弃一些压缩并使用普通的 7 位 ASCII 作为目标。
如果存储限制基于 UTF-16 字符,那么如果您关心转义或 UTF-16 合规性,则可以寻找一个大的安全子集,或者您可以尝试将每个字符用作 0..65535,如果涉及其他所有内容(例如数据库)没有问题。大多数软件层应该没有问题(ab)使用,但请注意,在 UTF-16 范围内 0xD800-0xDFFF 保留用于特殊用途(代理对),因此某些组合是正式的“编码错误”,理论上可以停止或扭曲。
在我为了好玩而编写的玩具4 KB JavaScript 演示中,我对压缩结果使用了一种编码,该编码将四个二进制字节存储到五个字符中,这些字符是从 85 个字符的 ASCII 子集中选择的,这些字符很干净,可以嵌入 JavaScript 字符串 (85^5略大于 8^4,但仍符合 JavaScript 整数的精度)。这使得压缩数据安全,例如对于JSON,无需任何转义。
在代码中,以下内容构建了 85 个“安全”字符的列表:
let cset = "";
for (let i=35; i<35+85+1; i++) {
if (i !== 92) cset += String.fromCharCode(i);
}
然后将 4 个字节 ( b0
, b1
,b2
和b3
每个从 0...255) 编码为 5 个字符,代码为:
// First convert to 0...4294967295
let x = ((b0*256 + b1)*256 + b2)*256 + b3;
// Then convert to base 85
let result = "";
for (let i=0; i<5; i++) {
let x2 = Math.floor(x / 85);
result += cset[x - x2*85];
x = x2;
}
要解码你做相反的事情,即从 base-85 数字计算 x,然后提取 4 个 base-256 数字(即字节)。
注意:在环面代码中我使用了一个稍微不同的字符集,而不是跳过 92\
我用 126 替换它~
。对于有兴趣的人,完整的解压代码是
// There are two Huffman-encoded code streams
// T - single chars (0..127) and sequence lengths (128...255)
// A - high bits of relative addresses of sequence (0..255)
//
// Expansion algorithm is:
// 1) Read a code X from T
// 2) If it's a char (X < 128) then add to output
// 3) otherwise (X>=128) read sequence address ADDR from stream A (high bits)
// and from input (low bits) and copy X-128 bytes from ADDR bytes "ago"
//
let Z = 5831; // expanded size
let i = 0, // source ptr
a = 0, // current bits accumulator
n = 0; // number of available bits in a
// Read a single bit
let b = function(){
if (!n) {
// There are no more bits available in the accumulator, read a new chunk:
// 5 ASCII escape-safe chars will be transformed in 4 8-bit binary bytes
// (like BASE64, just a bit more dense)
a = 0;
let w = 5;
while (w--) {
let y = s.charCodeAt(i+w); // get next char
a = a*85 + (y > 125 ? 92 : y) - 35; // extract base-85 "digit" (note, uses ~ instead of \ that requires quoting)
}
n = 32; // we got 32 bits in a
i += 5; // we consumed 5 characters from source
}
return (a >> --n) & 1; // extract a single bit
};
// Read a code of z bits by concatenating bits coming from b()
let v = function(z){
return (--z ? v(z) : 0)*2+b();
};
// Read an Huffman (sub-)tree: a bit will tell if we need to
// read a two sub-trees or a leaf
let h = function(){
return b() ? [h(), h()] : v(8);
};
// Read A and T Huffman trees
let A = h(), T = h();
// Extract a code given a node:
// if the node is an array (intermediate node) then we need to read a bit
// from the input binary stream to decide which way to go down the tree,
// if it's a number then we just return the value.
// `n.map` is truthy for arrays and falsy for numbers.
let d = function(n){
return n.map ? d(n[b()]) : n;
};
let S=""; // Output
// While we're not done
while(S.length<Z){
// Extract a code from T
x = d(T);
if (x < 128) {
// This is a single character, copy to output
S += String.fromCharCode(x);
} else {
// This is a sequence of x-128 bytes, get address and copy it
// Note: high 8 bits are from the Huffman tree A and 8 low bits
// are instead directly form the bit stream as they're basically
// noise and there's nothing to gain by trying to compress them.
S += S.substr(S.length-(d(A)<<8)-v(8), x-128)
};
}
(请注意,我没有测试这个重新格式化/评论的版本,可能存在拼写错误)