Optimizing Adler-32 with RVV: A Deep Dive into Chromium's Latest Patch
The new patch introduces several optimizations to improve the efficiency of the Adler-32 checksum calculation.
Zero Padding to Ensure Multiple of vl
Handling remaining data can be avoided by padding zeros to make the length of the array a multiple of vl. The Adler-32 value depends on the accumulation of the data, so padding with zeros does not impact the result.
size_t head = len & (vl - 1);
if (head > 0) {
vuint8m2_t zero8 = __riscv_vmv_v_x_u8m2(0, vl);
vuint8m2_t in = __riscv_vle8_v_u8m2(buf, vl);
in = __riscv_vslideup(zero8, in, vl - head, vl);
vuint16m4_t in16 = __riscv_vwcvtu_x(in, vl);
a_sum = in16;
buf += head;
}
Using Two 16-Bit Accumulators Instead of 32-Bit & 16-Bit Accumulators
Using a combination of a 32-bit accumulator and a 16-bit accumulator in the inner loop is inefficient. Instead, two 16-bit accumulators can be used.
Calculating the B Part with the Trapezoidal Rule
The accumulation of the B part can be calculated using the Trapezoidal rule. Assume we accumulate x times and the input buffer is all 255 in the worst case:
\(255 * ((1 + x) * x) / 2 <= 65535\)
and
\(255 * ((1 + 22) * 22) / 2 < 65535 < 255 * ((1 + 23) * 23) / 2\).
Therefore, the max value of x will be 22.
int batch = iters < 22 ? iters : 22;
// ...
while (batch-- > 0) {
vuint8m2_t in8 = __riscv_vle8_v_u8m2(buf, vl);
buf += vl;
b_batch = __riscv_vadd(b_batch, a_batch, vl);
a_batch = __riscv_vwaddu_wv(a_batch, in8, vl);
}
Efficient Modulo Handling
The patch introduces a more efficient way to handle overflow and perform modulo operations. Instead of performing modulo after each iteration, it checks if a_sum + a_batch would overflow and adjusts a_sum accordingly.
const vuint16m4_t a_overflow = __riscv_vrsub(a_sum, BASE, vl);
// ...
// Check if `a_sum + a_batch >= BASE`
vbool4_t ov = __riscv_vmsgeu(a_batch, a_overflow, vl);
a_sum = __riscv_vadd(a_sum, a_batch, vl);
// a_sum + (65536 - BASE) => a_sum - BASE in 16-bit context
a_sum = __riscv_vadd_mu(ov, a_sum, a_sum, 65536 - BASE, vl);
Overflow Handling Mechanism for b_sum
b_sum is a 32-bit accumulator, so we don’t need to prevent overflow at each iteration. Instead, we handle overflow when b_sum is close to exceeding \(2^{32}-1\).
When b_sum >= 65535 * BASE, we have a risk to overflow. We have to prevent overflow when the inner loop is executed b_overflow times.
In worst case, b_batch would be 65535 per iterator:
Therefore, the max value of b_overflow will be \(\text{BASE} / 23\).
const int b_overflow = BASE / 23;
int fixup = b_overflow;
// ...
b_sum = __riscv_vwaddu_wv(b_sum, b_batch, vl);
if (--fixup <= 0) {
// not sure how this code work
// b_sum = b_sum - BASE * (b_sum >> 16) is not mathematically eqaul to b_sum % BASE
// it seems to serve a similar purpose in controlling calue of b_sum
b_sum = __riscv_vnmsac(b_sum, BASE, __riscv_vsrl(b_sum, 16, vl), vl);
fixup = b_overflow;
}
Summary
The recent patch significantly improves the efficiency of the Adler-32 checksum calculation in Chromium, introduced by Simon Hosie from Rivos.
-
Increased Vector Register Utilization: The new implementation uses m2 instead of m1, which increases the utilization of vector registers, leading to better performance and parallelism in the inner loop.
-
Zero Padding for Vector Length Multiplicity: By padding zeros to ensure the length of the array is a multiple of the vector length (vl), the implementation ensures optimal vector processing without leftover elements that would require additional handling.
-
Efficient Overflow Prevention and Modulo Operation: The patch introduces a more efficient method to prevent overflow and handle modulo operations. The fixed inner loop executes a maximum of 22 times, making efficient use of two 16-bit accumulators.