Lucky you! You’re one of the first to be subjected to my first foray into an exploratory data analysis. Then again, maybe you’re just dying to let me hash out (*wink*) overly complicated niched-community topics into layman’s terms.

Ever since a very enlightening workshop on the wonders of the blockchain and bitcoin, I’ve felt drawn to uncover the basics (arcane mysteries to me, of course) in the second coming of Chri-whoops, cryptocurrencies. By that, I mean Ethereum. (Move aside Bitcoin, you’re boring:vanilla:: …)

Not quite. Ethereum is a blockchain based computing platform of which it’s cryptocurrency, ether, is only one of many functionalities. Introduced only in 2013 by 19-year old Vitalik Buterin of UWaterloo (shout-out to Canada), it has evolved drastically over the past 4 years thanks to a notorious hackjob. One of the selling features of ethereum — the ability to create customizable smart-contracts — is a reason for why it has grown enormously popular for facilitating online gambling. A particular crowdfunded smart-contract — a brainchild of the DAO (Decentralized Autonomous Organization) — backfired enormously on the whole Ethereum community. Valued at 250 million USD at it’s peak, a security oversight caused the DAO contract to get hacked and drained of 3.6m Ether (70 million USD).

It was like “losing a p-hat in the wildy because you were a cocky l337 tank”. Ouch.

So we had two options; the safe space vs. the brave space. Supporters of the hard fork wished to return the stolen ethereum back to the DAO; a bailout. Meanwhile, others argued that the blockchain must be immutable, that code was law and this may render the future of Ethereum unstable. In the end, the chain was forked, Ethereum Classic 2.0 was introduced and thus began the start of a new saga.

I begin the EDA by analyzing the data to see if there are any underlying relationships between variables. This dataset I downloaded from Kaggle unfortunately does not take the hard fork into account; it begins from the June 2017 spike (less excitement for us, eh?). The null hypothesis is that there aren’t any significant linear correlations that cause the price in USD of ethereum to increase, but I hope to prove otherwise.

Of course, it’s important to make a prediction. Based on these conditions, what does the future look like for ethereum?

ether_data <- read.csv("all_data.csv")
lm_ether1 <- lm(price_USD ~ total_eth_growth + hashrate + transactions + timestamp + total_addresses + blocksize, data = ether_data)
> summary(lm_ether1)
Call:
lm(formula = price_USD ~ total_eth_growth + hashrate + transactions +
timestamp + total_addresses + blocksize, data = ether_data)
Residuals:
Min 1Q Median 3Q Max
-46.698 -2.001 0.225 2.364 61.068
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.423e+04 1.036e+03 -23.388 < 2e-16 ***
total_eth_growth -6.866e-05 2.612e-06 -26.288 < 2e-16 ***
hashrate -1.780e-03 2.363e-04 -7.531 1.62e-13 ***
transactions 5.264e-04 3.199e-05 16.457 < 2e-16 ***
timestamp 2.028e-05 8.505e-07 23.848 < 2e-16 ***
total_addresses 1.537e-04 5.359e-06 28.686 < 2e-16 ***
blocksize 1.589e-03 4.978e-04 3.193 0.00147 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.732 on 677 degrees of freedom
Multiple R-squared: 0.9773, Adjusted R-squared: 0.9771
F-statistic: 4858 on 6 and 677 DF, p-value: < 2.2e-16

It’s apparent that each of these coefficients seem to be significant, albeit for “blocksize”, but I can’t think of a way to plot this model with every variable. Therefore, I plot each separately to start.

plot(price_USD ~ timestamp, data = ether_data)
plot(price_USD ~ total_eth_growth, data = ether_data)
plot(price_USD ~ market.cap.value, data = ether_data)
plot(price_USD ~ transactions, data = ether_data)
plot(price_USD ~ hashrate, data = ether_data

The output for price_USD as a function of total_eth_growth doesn’t look too promising for fitting a linear regression. Unfortunately, a similar pattern rang true for most of the other coefficients. However, there seems to be a second order (quadratic) fit to the “price_USD ~ hashrate” and “price_USD ~ transactions” plot. If anything, the sub-par R² values suggest a better line-of-best-fit is out there somewhere.

#linear model for price_USD ~ transactions
lm_ether5 <- lm(price_USD ~ transactions, data = ether_data)
summary(lm_ether5) #produces...
> Residuals:
Min 1Q Median 3Q Max
-45.463 -12.551 -4.679 14.176 137.183
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.410e+01 1.070e+00 -22.52 <2e-16 ***
transactions 1.045e-03 1.858e-05 56.23 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 18.75 on 682 degrees of freedom
Multiple R-squared: 0.8226, Adjusted R-squared: 0.8223
F-statistic: 3162 on 1 and 682 DF, p-value: < 2.2e-16
-----------------------------------------------------------------
#linear model for price_USD ~ hashrate lm_ether6<- lm(price_USD ~ hashrate, data = ether_data)
summary(lm_ether6) #produces...
> Residuals:
Min 1Q Median 3Q Max
-46.035 -11.141 5.673 8.583 192.344
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.715e+00 9.569e-01 -10.15 <2e-16 ***
hashrate 5.051e-03 9.671e-05 52.23 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19.91 on 682 degrees of freedom
Multiple R-squared: 0.8, Adjusted R-squared: 0.7997
F-statistic: 2728 on 1 and 682 DF, p-value: < 2.2e-16
Left-image: A plot of price_USD ~ transactions. Right-image: A plot of price_USD ~ hashrate.

I wasn’t entirely sure what I was doing, but I decided to follow the scent.

lm_ether3 <- lm(price_USD ~ I(transactions**2), data = ether_data)
summary(lm_ether3)
> Residuals:
Min 1Q Median 3Q Max
-106.782 -2.126 -0.653 1.291 69.382
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.225e+00 3.753e-01 3.263 0.00116 **
I(transactions^2) 5.829e-09 4.591e-11 126.955 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.97 on 682 degrees of freedom
Multiple R-squared: 0.9594, Adjusted R-squared: 0.9593
F-statistic: 1.612e+04 on 1 and 682 DF, p-value: < 2.2e-16
--------------------------------------------------------------lm_ether4 <- lm(price_USD ~ I(hashrate**2), data = ether_data)
summary(lm_ether4)
>Residuals:
Min 1Q Median 3Q Max
-41.481 -2.814 -1.113 4.465 86.253
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.632e+00 3.360e-01 10.81 <2e-16 ***
I(hashrate^2) 1.730e-07 1.240e-09 139.48 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.193 on 682 degrees of freedom
Multiple R-squared: 0.9661, Adjusted R-squared: 0.9661
F-statistic: 1.945e+04 on 1 and 682 DF, p-value: < 2.2e-16
---------------------------------------------------------------     #Looks like a WAY better fit. Let's plot the improved models!
plot(price_USD ~ I(transactions**2), data = ether_data)                      #found coefficients for abline from summary(lm_ether3)
abline(a = 1.225, b = 5.829e-09, col = "red", lty = "solid")
#plot(price_USD ~ I(hashrate**2), data = ether_data)                                  #found coefficients for abline from summary(lm_ether4)
abline(a =3.632, b = 1.730e-07, col = "red", lty = "solid" )
A plot depicting the ethereum’s price in USD as a function of the squared transaction numbers/time. The second plot depicts the price in USD as a function of the squared hashrates/time. Both plots contain a line of best fit.

As you can see, there has been a massive improvement in R-squared values between the two different models for price_USD ~ transactions (from ~0.82 to ~ 0.96), cementing the quadratic model’s fit.

Similarly, we have an increase in the R-squared values (from ~0.80 to ~0.97) from when we transition from the first-order to second-order fit pertaining to price_USD~ hashrate. Semi-success! The next step would be to introduce a new data set from which we can extrapolate the data in order to make predictions. Due to time and energy constraints, I choose to let this slide.

Do hashrate and the number of transactions/time have an influence on ethereum’s value in USD? Probably. Do any of the other coefficients have an affect as well? Yeah, maybe, if I could figure out how to fit a linear regression to those plots. I’m inclined to accept the alternate hypothesis although I realize that this EDA had a lot of shortcomings. However, I’m going to choose to pat myself on the back for entering a jungle; dark but full of diamonds. I’ve a long way to go, but eventually I’ll learn how to better mine them.

Creds to Kaggle for the dataset, Ray Heberer for giving me just the right amount of hints, and Vitalek for inventing the topic



SOURCE