Skip to main content

Command Palette

Search for a command to run...

Exploring Data Engineering with Rust

Learning Data Engineering Through Real Solana Projects

Updated
4 min read
Exploring Data Engineering with Rust
F

I help secure users funds on the DeFi protocols by providing audits and educating users on how to prevent hacks from their end..

I'm diving into the world of big data using Rust. This might sound unusual—isn't Python the industry standard? Yes, Python dominates the field, but I chose Rust because I want to master it for systems programming, which is my end goal.

It all started when I reached out to 0xIchigo about getting a job as a Rust developer. He told me that writing standalone programs isn't in high demand right now—the opportunities are in Rust combined with backend work and ETL pipelines. I started with some tutorials but found myself getting bored and losing energy to continue with the practice. That's when I started working on real projects to better understand how this works. I learn best through hands-on project work.

How Is It Going?

Progress has been challenging since most data packages aren't readily available for Rust. For now, I'm building data pipelines on Solana, which are primarily ETL (Extract, Transform & Load) pipelines. I haven't delved into data warehousing and data lakes yet. When I checked job descriptions for data engineering roles, I noticed they're split between GCP and Azure, and their packages are mainly in Python.

I'm planning to explore many Rust data crates, but at the moment Polars caught my eye. It's marketed as a dataframe library written in Rust that's faster than PySpark and Pandas with 50x performance gains, and can be easily installed in Python applications, Rust, and TypeScript.

What Have I Done?

For starters, I've worked on Solana token ETL and Juplend (a Solana protocol for lending and borrowing) ETL. I made use of Solana gRPC, RPC, and ClickHouse (the standard for crypto developers, though I used PostgreSQL at first and had to build a migrator to ClickHouse).

I built an indexer—indexing is data engineering in my opinion since you extract data from the blockchain (which isn't optimized for reading), transform these bytes into something readable based on the data you need, and then load it into a database. It fulfills the ETL pattern.

How Does The Solana Token ETL Work?

The ETL works like any other indexer where data related to tokens on Solana are extracted. In my case, I started with PumpFun and PumpSwap, the most used token launchpads and swaps, where I focused on trades (buy and sell), token creation, and migrations. I also delved into backfilling since I was using gRPC to listen to the latest transactions and not the history.

I struggled with ClickHouse integration, especially the insert operations—RowWrite and RowRead required significant learning to understand these row operations. When working with PostgreSQL, it was easier, or I should say I nailed it the first time compared to ClickHouse where I spent over 8 hours trying to use it.

My plan for this project is to turn it into a real company that provides data APIs that people can use in their bots and applications.

Do I Think Data Engineering Is Hard?

Yes and no.

I can't say it's hard, maybe because I have some experience with data analysis, so I understand data to some extent. Where it becomes challenging is understanding the structure of where your data is coming from. I once had to purge 200,000 rows of data when I made a mistake saving the wrong amount. This experience is unique to Solana—understanding the Solana account model will make it easier for you to get the data you want.

I haven't explored any data frames, data lakes, or tried warehousing data, but based on my experience, understanding the data source structure was the main challenge, apart from understanding SQL and the programming language you're using to build your ETL pipeline (which in my case is Rust).

In my next project, I'll be using Polars, a data frame library, to understand the correlation between Ethereum gas prices and Solana activity. With that, I'll be able to discuss data frames in depth.

Also, with data engineering, you need to own your pipelines. This means you should be able to identify what's wrong with the data and pinpoint where the fault is coming from directly. You also need to understand the data you're gathering and what it means, as this is what you'll be communicating to stakeholders.

Follow My Journey

I'll be documenting my entire data engineering journey with Rust—the wins, the struggles, and everything I learn along the way. If you're interested in following along, building similar projects, or just want to connect, you can find me on X (Twitter) and LinkedIn. I share updates on what I'm working on, the challenges I face, and solutions I discover. Let's learn and grow together in this space.