Back to top

Master's Thesis Alexander Hefele

Last modified Feb 25

A Conceptual Model for Ethereum Blockchain Analytics


The hype over Ethereum has drastically decreased over the past year: dropping exchange rates, stagnating transaction counts, and stabilizing transaction fees are only three indicators for this development. Nevertheless, Ethereum still plays a very important role in the crypto-currency ecosystem as it is still the second largest crypto-currency by market capitalization and remains a very popular platform for initial coin offerings (ICOs).


In this thesis, we develop a comprehensive formal model representing the Ethereum platform in the form of UML class diagrams. Splitting the system into the four parts "Source", "EVM", "Storage", and "Ledger" helps us to bring a clear structure into this complex environment. These four parts aim to give a deep understanding of the contract-programming language Solidity, the underlying Ethereum Virtual Machine, how each node in the network stores account and state information, and the contents of the blockchain itself.

Afterwards, we apply our knowledge about the system and explore what data can be extracted from the Ethereum platform, and how this can be done efficiently. In the relational database that we build up, we store bytecodes and additional information of all user- and contract-created smart contracts from the first 6,900,000 blocks.

With this data, we perform different analyses to gain more insights into the system. The researched anomalies include front-running, self-destructing constructors, and transactions to accounts that only become contracts after the transaction has been executed. Additionally, we cluster smart contracts based on different criteria, like who created them and whether they implement ERC token standards. Consulting metadata information, like references of hard-coded addresses in the bytecode of contracts, the usage of certain function signature hashes, and the balances of contracts that a contract created, further refines our system understanding.

The main contribution of this work is the estimation of compiler and Solidity library versions of arbitrary smart contracts. With two heuristics based on the contract creation date and the bytecode header, we set a range of minimum and maximum compiler versions for every contract code. We discover usage of the most popular Solidity library "SafeMath" by compiling every version of the library with every compatible compiler version, extracting its internal functions, and comparing the resulting bytecodes with all contract codes deployed on the blockchain. That also helps us improve the compiler version estimation.

We evaluate our version estimations with verified contracts from the block explorer website Etherscan. For our compiler version estimation, the range we set is correct for 99% of the evaluated contract codes. The median size of the estimated compiler version range is 3. For SafeMath usage detection, we have a success rate of 82% with a median distance of 4. Despite considering 31 SafeMath versions, the highest library distance our approach sets for a contract code is only 14.

Research Questions

  1. How are the different parts of the Ethereum system correlated with each other?
  2. What data can be extracted from the blockchain for analysis and how can this be done efficiently?
  3. What does metadata tell us about the network?
  4. What are different areas of application of the Ethereum blockchain?
  5. What anomalies can be observed in the network?


Files and Subpages