The Open-Source ETL Rust-based framework for Blockchain Data on BigQuery by BCW Group

As a contributor to the open-source modular Rust ETL framework used in Google’s BigQuery PDP as a Web3 standard, BCW Technologies (BCWT) has invaluable experience in building blockchain ETL structures and public data sets.

Blockchain Extract-Transform-Load (ETL) is a process that consolidates data from nodes into a single and consistent data pipeline that may be configured for a wide range of uses and use cases. Within the context of BCWT’s work on Google’s BigQuery PDP, the Blockchain ETL program is committed to developing data retrieval modules via an open-source framework to make a select blockchain network’s data accessible to the public through downstream data services that facilitate access for research and testing purposes.

Blockchain networks generate a tremendous amount of data both inside and out of the Ethereum Virtual Machine (EVM) sphere of influence. For example, the total public Solana dataset currently contains a staggering amount of data adding up to nearly a petabyte of historical data. Given the growing volume and increasing velocity of the compounding size of Solana’s data set, most firms would find it difficult to manage both historical and real-time data streams efficiently.

With such a high volume of data being generated by networks, harnessing that data into a format that any user, including enterprises, can digest and utilize for their mission-critical application developments presents a series of challenges. As will be shown, BCWT has overcome the technological challenges of revamping data streams with ETL Programs through experience and an innovative approach to data management.

BCWT ETL Program

With extensive experience in building EVM and non-EVM network ETLs, BCWT stands apart from many others in the field. One of the ways in which BCWT differs is by delivering individual components following the ETL paradigm & producing the overall scaffolding/framework for ingesting, filtering, and delivering network data to any data source. Furthermore, as a Google Cloud Premier Service Partner, we work with blockchain network foundations and Google Cloud to build indexers and data insert routines for use with BigQuery in addition to the available Google blockchain public datasets.

Our Extraction and Transformation logic is written in Rust, the same language used by major blockchain networks such as Solana and Near, and currently supports two networks — Solana and Aptos. Utilizing Rust ensures speed, resilience, and security. It is also deployed in stateless Docker containers to be used in Kubernetes with autoscaling for network activity.

Rust powers Aptos. Rust powers Solana. Thus, tapping into these chains via rust is a natural move for the industry. In practice, this means being able to directly import the official block structure from Solana to guarantee accurate API deserialization. Likewise, Aptos’ official code provides a function to call its gRPC API to request a range of transactions. These utilities ensure correct API usage, and the type-safety of Rust compounds this to help ensure an accurate data transformation, too.

How we import Aptos’ gRPC API-calling code
How we use Aptos’ gRPC API-calling code
How we use Solana’s official block structure (UiConfirmedBlock) to get the transaction data from a block requested with the API
A visualization of the ETL workflow from BCWT

We utilize cloud infrastructure exclusively from Google Cloud. We use Pub/Sub as a scalable indexing queue, and Dataflow as a scalable data loader into BigQuery. Pub/Sub is built to improve scalability via load distribution, message buffering, and decoupling components in a system so they can communicate asynchronously and independently of each other. It is instrumental in enabling us to ingest analytic events for streaming into BigQuery and other destinations.

As a fully managed serverless data processing service, Dataflow provides unified stream data processing at scale, which, when used in conjunction with Pub/Sub, is perfectly suited for ETL workflows. While Dataflow enables batch processing, BCWT opts for its data streaming capabilities.

Clients and partners of BCWT continue to benefit from our high-performance approach to ETL that is able to deliver data in real-time. In one specific case, a blockchain network’s Mainnet data is generally available for querying in BigQuery within 5 seconds of blockchain network confirmation, depending on network activity.

Moreover, BCWT performs daily quality control checks with Google Cloud Composer (a managed instance of Apache Airflow). This approach manifests two key benefits, the first of which is redundancies in the form of targeted backfills in case of failures anywhere in the data pipeline. Another is deduplication, which increases storage efficiency by eliminating superfluous data duplicates.

The processes used by BCWT and the benefits to clients and partners can be seen in production through open-source code repositories found in Blockchain ETL’s GitHub organization. There, any account can view, audit, propose changes, and use the code via the permissive MIT license.

In the Blockchain ETL’s GitHub repositories, BCWT has contributed all of the code and infrastructure scripting necessary to reproduce these public Aptos and Solana datasets. These community contributions are complete with Dockerfiles and hosted Docker images for the indexer, as well as Terraform scripts to deploy the cloud services in any user’s Google Cloud project, including the blockchain node. In the case of Aptos, Cloud Composer scripts are also available for daily quality control checks. Finally, BCWT has contributed all of the code to the ETL-Rust repository, containing a framework for building another Rust-based ETL.

The infrastructure-as-code for deploying a pipeline (including everything from the node, to the dataset) is available here for Aptos, and here for Solana, within their respective GitHub repositories. To learn more about doing data analysis on Aptos check out this article.

The quality of our service is exemplified by references, which can be produced upon request. Contact us to learn more at https://bcw.group/contact/.

Share On:

Facebook
X
LinkedIn
Reddit

Related Posts