GSoC 2023 Report

4 minute read

Published:

For GSoC 2023, I worked with LPython - a high performance typed Python compiler. In particular, I worked on implementing and improving advanced data structures. My mentors were Gagandeep Singh and Thirumalai Shaktivel.

I worked on several data structures, improving existing implementations, and introducing others. Weekly blogs can be found here. In this report, major contributions and learnings are summarized.

List and Tuple

The list implementation was lacking in many standard and commonly-used CPython functions, which were added. These include

  • index
  • count
  • pop
  • reverse

Some of these are overloaded, and implementations were added to make them CPython-compatible. Examples include pop and index.

Next, support for arbitrary amount of nesting involving list and tuple was added. An example:

l: list[list[tuple(i32, f64)]] = [[(1, 2.0), (3, 4.0)], [(5, 6.0)]]

Further, tuple and list comparison operators were added. For example,

assert [1, 2] > [0, 1, 2]
assert (1, 2) < (3, 1)

Wherever possible, the IntrinsicFunction architecture was used. This prevents unnecessary nodes of the form ListFunction in the ASR. We just register these functions like list.function under the umbrella of IntrinsicFunction.

Dictionary

dict (hash-map) has Linear-Probing and Separate-Chaining implementations

  • Linear-Probing is an open-addressing collision-resolution technique. All key-value pairs are stored in a single place, and key-collisions are resolved by probing for the next empty location. It is ensured that the fractional-occupancy does not cross a predetermined threshold. This is also used in the CPython implementation.
  • Separate-Chaining uses separate linked-lists for each hash-value, where resolving collisions is simply adding elements to the corresponding linked-list. As with Linear-Probing, it is ensured that the fractional-occupancy does not exceed a threshold. This is used in the C++ STL implementation of unordered_map.

Existing implementations had issues in insertion and deletion of elements. These were fixed.

Next, frequently-used functions dict.keys and dict.values were added. Unlike CPython, LPython does not support views, so lists are returned instead.

Further, support for nested dictionaries was added. These are of the form

d: dict[i32, dict[i32, i32]] = {1: {2: 4, 3: 9}, 7: {8: 64, 9: 81}}
d[3] = {4: 16, 5: 25}
d[1][10] = 100

Set

We introduced sets (hash-sets).

Similar to dictionaries, we provide two collision-resolution schemes - Linear-Probing and Separate-Chaining. An overview of these techniques was given in the previous section. Now we present some more implementation-specific details:

  • Hash computation: Similar to both CPython (set) and C++ (unordered_set), for integers it is the integer itself (modulo set capacity). For strings, we use Polynomial Rolling Hash.

  • Linear Probing
    • A single list of elements is maintained. Further, there is a list storing masks, that indicates if an element is present or not. The following mask values are used
      • 0: no element placed
      • 1: element placed without probing (i.e., hash value and position are same)
      • 2: element placed with probing
      • 3: tombstone (i.e., element was deleted, and now this position is empty)
    • Insertion and deletion routines are implemented as per the standard linear probing scheme
  • Separate Chaining
    • Here, a chain (linked-list) of elements is maintained at each hash location. For example, to access the ith linked-list
    • This linked-list has items of the form element, next, where next is a pointer to the next element in this list (or nullptr if none exists)
    • Set insertion and deletion occur via these lists
  • Benchmarking
    • Gist: Here we use randomized benchmarks, where low-overhead psuedorandom number generation is done using Lehmer random number generator.
    • It is to be emphasized that LPython easily surpasses C++ and Python.

Learnings

I learnt a lot over the entire period. Most importantly, it was a great experience to be part of an open-source project of this scale! I had recently taken a Compilers course in university, and got to closely see the innards of one here. Moreover, I got introduced to LLVM, and implemented data-structures and algorithms for an actual compiler.

Thank You!

I would like to thank Ondřej Čertík for the conception of this wonderful compiler. My mentors Gagandeep Singh and Thirumalai Shaktivel were immensely helpful before and during this entire project. Thank you! Finally, thanks to all LPython contributors on whose work I could build on.