GSoC 2023 Report
Published:
For GSoC 2023, I worked with LPython - a high performance typed Python compiler. In particular, I worked on implementing and improving advanced data structures. My mentors were Gagandeep Singh and Thirumalai Shaktivel.
I worked on several data structures, improving existing implementations, and introducing others. Weekly blogs can be found here. In this report, major contributions and learnings are summarized.
List and Tuple
The list
implementation was lacking in many standard and commonly-used CPython functions, which were added. These include
index
count
pop
reverse
Some of these are overloaded, and implementations were added to make them CPython-compatible. Examples include pop
and index
.
Next, support for arbitrary amount of nesting involving list
and tuple
was added. An example:
l: list[list[tuple(i32, f64)]] = [[(1, 2.0), (3, 4.0)], [(5, 6.0)]]
Further, tuple and list comparison operators were added. For example,
assert [1, 2] > [0, 1, 2]
assert (1, 2) < (3, 1)
Wherever possible, the IntrinsicFunction
architecture was used. This prevents unnecessary nodes of the form ListFunction
in the ASR. We just register these functions like list.function
under the umbrella of IntrinsicFunction
.
Dictionary
dict
(hash-map) has Linear-Probing and Separate-Chaining implementations
- Linear-Probing is an open-addressing collision-resolution technique. All key-value pairs are stored in a single place, and key-collisions are resolved by probing for the next empty location. It is ensured that the fractional-occupancy does not cross a predetermined threshold. This is also used in the CPython implementation.
- Separate-Chaining uses separate linked-lists for each hash-value, where resolving collisions is simply adding elements to the corresponding linked-list. As with Linear-Probing, it is ensured that the fractional-occupancy does not exceed a threshold. This is used in the C++ STL implementation of
unordered_map
.
Existing implementations had issues in insertion and deletion of elements. These were fixed.
Next, frequently-used functions dict.keys
and dict.values
were added. Unlike CPython, LPython does not support views, so lists are returned instead.
Further, support for nested dictionaries was added. These are of the form
d: dict[i32, dict[i32, i32]] = {1: {2: 4, 3: 9}, 7: {8: 64, 9: 81}}
d[3] = {4: 16, 5: 25}
d[1][10] = 100
Set
We introduced sets (hash-sets).
Similar to dictionaries, we provide two collision-resolution schemes - Linear-Probing and Separate-Chaining. An overview of these techniques was given in the previous section. Now we present some more implementation-specific details:
Hash computation: Similar to both CPython (
set
) and C++ (unordered_set
), for integers it is the integer itself (moduloset
capacity). For strings, we use Polynomial Rolling Hash.- Linear Probing
- A single list of elements is maintained. Further, there is a list storing masks, that indicates if an element is present or not. The following mask values are used
- 0: no element placed
- 1: element placed without probing (i.e., hash value and position are same)
- 2: element placed with probing
- 3: tombstone (i.e., element was deleted, and now this position is empty)
- Insertion and deletion routines are implemented as per the standard linear probing scheme
- A single list of elements is maintained. Further, there is a list storing masks, that indicates if an element is present or not. The following mask values are used
- Separate Chaining
- Here, a chain (linked-list) of elements is maintained at each hash location. For example, to access the
i
th linked-list - This linked-list has items of the form
element, next
, wherenext
is a pointer to the next element in this list (ornullptr
if none exists) - Set insertion and deletion occur via these lists
- Here, a chain (linked-list) of elements is maintained at each hash location. For example, to access the
- Benchmarking
- Gist: Here we use randomized benchmarks, where low-overhead psuedorandom number generation is done using Lehmer random number generator.
- It is to be emphasized that LPython easily surpasses C++ and Python.
Learnings
I learnt a lot over the entire period. Most importantly, it was a great experience to be part of an open-source project of this scale! I had recently taken a Compilers course in university, and got to closely see the innards of one here. Moreover, I got introduced to LLVM, and implemented data-structures and algorithms for an actual compiler.
Thank You!
I would like to thank Ondřej Čertík for the conception of this wonderful compiler. My mentors Gagandeep Singh and Thirumalai Shaktivel were immensely helpful before and during this entire project. Thank you! Finally, thanks to all LPython contributors on whose work I could build on.