Skip to content

gh-146192: Add base32 support to binascii#146193

Open
kangtastic wants to merge 4 commits intopython:mainfrom
kangtastic:base32-accel
Open

gh-146192: Add base32 support to binascii#146193
kangtastic wants to merge 4 commits intopython:mainfrom
kangtastic:base32-accel

Conversation

@kangtastic
Copy link
Contributor

@kangtastic kangtastic commented Mar 20, 2026

Synopsis

Add base32 encoder and decoder functions implemented in C to binascii and use them to greatly improve the performance and reduce the memory usage of the existing base32 codec functions in base64.

No API or documentation changes are necessary with respect to any functions in base64, and all existing unit tests for those functions continue to pass without modification.

Resolves: gh-146192

Discussion

The base32-related functions in base64 are now wrappers for the new functions in binascii, as envisioned in the docs:

The binascii module contains a number of methods to convert between binary and various ASCII-encoded binary representations. Normally, you will not use these functions directly but use wrapper modules like uu or base64 instead. The binascii module contains low-level functions written in C for greater speed that are used by the higher-level modules.

Comments and questions are welcome.

Benchmarks

Benchmark script

# bench_b32.py

# Note: Can be EXTREMELY SLOW on unmodified mainline CPython.

import base64
import sys
import timeit
import tracemalloc

funcs = [(base64.b64encode, base64.b64decode), # sanity check/comparison
         (base64.b32encode, base64.b32decode),
         (base64.b32hexencode, base64.b32hexdecode)]

def mb(n):
    return f"{n / 1024 / 1024:.3f}"

def stats(func, data, t, m):
    name, n, bps = func.__qualname__, len(data), len(data) / t
    print(f"{name:<16}{n:<16}{t:<11.3f}{mb(bps):<13}{mb(m)}")

if __name__ == "__main__":
    print(f"Python {sys.version}\n")
    print(f"function        processed (b)   time (s)   avg (MB/s)   mem (MB)\n")
    data = b"a" * int(sys.argv[1]) * 1024 * 1024
    for fenc, fdec in funcs:
        tracemalloc.start()
        enc = fenc(data)
        menc = tracemalloc.get_traced_memory()[1] - len(enc)
        tracemalloc.stop()
        tenc = timeit.timeit("fenc(data)", number=1, globals=globals())
        stats(fenc, data, tenc, menc)

        tracemalloc.start()
        dec = fenc(enc)
        mdec = tracemalloc.get_traced_memory()[1] - len(dec)
        tracemalloc.stop()
        tdec = timeit.timeit("fdec(enc)", number=1, globals=globals())
        stats(fdec, enc, tdec, mdec)

Unmodified mainline CPython

$ ./python bench_b32.py 16
Python 3.15.0a7+ (heads/main:d357a7dbf38, Mar 19 2026, 23:22:25) [GCC 15.2.0]

function        processed (b)   time (s)   avg (MB/s)   mem (MB)

b64encode       16777216        0.015      1088.370     0.000
b64decode       22369624        0.017      1264.389     0.000
b32encode       16777216        2.308      6.933        17.382
b32decode       26843552        3.389      7.553        27.787
b32hexencode    16777216        2.338      6.843        17.379
b32hexdecode    26843552        3.388      7.557        27.787

With this PR

$ ./python bench_b32.py 16
Python 3.15.0a7+ (heads/base32-accel:72fd0f0302a, Mar 20 2026, 00:04:23) [GCC 15.2.0]

function        processed (b)   time (s)   avg (MB/s)   mem (MB)

b64encode       16777216        0.015      1084.957     0.000
b64decode       22369624        0.016      1363.524     0.000
b32encode       16777216        0.017      967.528      0.000
b32decode       26843552        0.016      1581.002     0.000
b32hexencode    16777216        0.016      995.277      0.000
b32hexdecode    26843552        0.016      1588.353     0.000

Encoding performance is improved by ~150x, decoding performance is improved by ~200x,
and no auxiliary memory is used.


📚 Documentation preview 📚: https://cpython-previews--146193.org.readthedocs.build/

Add base32 encoder and decoder functions implemented in
C to `binascii` and use them to greatly improve the
performance and reduce the memory usage of the existing
base32 codec functions in `base64`.

No API or documentation changes are necessary with
respect to any functions in `base64`, and all existing
unit tests for those functions continue to pass without
modification.

Resolves: pythongh-146192
@serhiy-storchaka
Copy link
Member

You can now update your PR, @kangtastic.

@kangtastic
Copy link
Contributor Author

@serhiy-storchaka Already on it 😄

- Use the new `alphabet` parameter in `binascii`
- Remove `binascii.a2b_base32hex()` and `binascii.b2a_base32hex()`
- Change value for `.. versionadded::` ReST directive in docs for
  new `binascii` functions to "next" instead of "3.15"
@kangtastic kangtastic marked this pull request as ready for review March 20, 2026 16:03
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some suggestions, but the core LGTM.

Please add assertions for new alphabets in test_constants.


.. function:: b2a_base32(data, /, *, alphabet=BASE32_ALPHABET)

Convert binary data to a line(s) of ASCII characters in base32 coding,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a single line.

I will add wrapcol in a separate issue.


Convert base32 data back to binary and return the binary data.

Valid base32 data:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list is incomplete and redundant. I think it is better to follow the example of ascii85 and base85 (with a reference to the RFC). Mention that the mapping is case-sensitive and no optional mapping of the digit "0" and "1" to letters "O", "I" or "l" is used.


.. data:: BASE32_ALPHABET

The base32 alphabet according to :rfc:`4648`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The base32 alphabet according to :rfc:`4648`.
The Base 32 alphabet according to :rfc:`4648`.


.. data:: BASE32HEX_ALPHABET

The "Extended Hex" base32hex alphabet according to :rfc:`4648`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The "Extended Hex" base32hex alphabet according to :rfc:`4648`.
The "Extended Hex" Base 32 alphabet according to :rfc:`4648`.

These are the names used in the table 3 and 4 captions in RFC 4648.

Oh, we can even refer directly to the table:

Suggested change
The "Extended Hex" base32hex alphabet according to :rfc:`4648`.
The "Extended Hex" Base 32 alphabet according to :rfc:`4648`, table 4.

Add this also for Base 64 alphabets if you choose this variant.

Copy link
Contributor Author

@kangtastic kangtastic Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if this would come up. RFC 4648 uses all four of the terms "Base 32", "Base32", "base 32", and "base32" to refer to this encoding at various points, but it also states e.g.:

This encoding may be referred to as "base32hex". This encoding should not be regarded as the same as the "base32" encoding and should not be referred to as only "base32".

and e.g.:

One property with this alphabet, which the base64 and base32 alphabets lack...

thus implying that "base32" and "base32hex" are preferred, even if the rest of the document doesn't adhere to the implication.

Anyway, I'll refer to it as "Base 32" in docs for now to fit what's already there, and not reference the table number or touch any Base64 stuff so as to keep the scope of this PR limited.

Lib/base64.py Outdated
Comment on lines 212 to 213
if len(s) % 8:
raise binascii.Error('Incorrect padding')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not this be handled in the C code?

_b32rev[alphabet] = {v: k for k, v in enumerate(alphabet)}

def _b32decode_prepare(s, casefold=False, map01=None):
s = _bytes_from_decode_data(s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only needed if map01 is not None.

Lib/base64.py Outdated
if alphabet not in _b32rev:
_b32rev[alphabet] = {v: k for k, v in enumerate(alphabet)}

def _b32decode_prepare(s, casefold=False, map01=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to inline this function. map01 handling is only needed for standard alphabet, and the code for casefold is trivial.

*
alphabet: Py_buffer(c_default="{NULL, NULL}") = BASE32_ALPHABET

base32-code line of data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
base32-code line of data.
Base32-code line of data.

- Update docs to refer to "Base 32" and "Base32"
- Update docs to better explain `binascii.a2b_base32()`
- Inline helper function in `base64`
- Add forgotten tests for presence of alphabet module globals
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

C accelerator for Base32 character encoding

2 participants