"I had my first build in a few hours, so why not share it?"
Steam Spy creator Sergey Galyonkin talks data, disclaimers and the margin of error
If you need any more evidence of just how crucial Steam's share of the market is to PC gaming as a whole, you don't need to look much further than the excitement which surrounded the emergence of SteamSpy over the Easter weekend. Developed in his spare time by a senior analyst at Wargaming, Steam Spy uses the publicly available Steam API to pull together a wealth of data on the ownership, geographical spread, play time and current engagement of every single one of the fully released games on Valve's service. It's resulted in reams and reams of glorious data, a wealth of figures and graphs and percentages which shines a light on one of the industry's most opaque marketplaces.
After 12 years of guesswork, rough estimates and increasingly irrelevant retail charts - journalists, gamers and developers alike pored over the lush, fresh figures, rapidly turning napkin maths into revenue figures for their favourite (or least favourite) games. A new dawn of insight had arrived, with Steam's internal workings blown wide open, and the story was that the actual weight of Gabe's jar of jellybean's could finally be pinpointed instead of just guessed at. But some of that early excitement was misplaced.
"It's not Steam sales," Steam Spy creator Sergey Galyonkin tells me over Skype, with a little resignation. "If someone buys a game from somewhere other than Steam but registers it there, it's going to be here. I wrote an 'About' page exactly for this sort of stuff, but...people don't read manuals.
"'Owned' means 'owned'. It includes games bought on Steam, bought in retail and then activated on Steam, bought in bundles, received through promotions or as a gift and so on. It also includes copies temporarily given away on 'free weekends', so take care when coming to conclusions about sales or revenue."
from Steam Spy's 'About' page
"It wasn't designed for a general audience, I've seen a lot of confusion already because a lot of people are assuming that owners data is equivalent to sales data - so they're multiplying owners by price to get hundreds of millions of dollars of sales for games. They're coming to really wrong conclusions. It was designed primarily for developers and journalists, mostly.
"I expected to get some exposure, but not for it to be so big. I thought people would like it, but not this much! I thought it would be people coming to check on their favourite game, then only coming back when a huge hit launches.
"The people trying to work out how much money a developer has made don't know how these things work - if you have had a game released then you know that the majority of sales come from discounts, not from full launch price. They don't know that selling seven million copies doesn't mean seven million copies at $60. It's maybe half a million at $60 and the rest at discount." A small sigh, but I feel like Galyonkin has largely worked through any feelings of responsibility for the over-eager analysis of others.
The past few days have certainly seen enough misinterpretation, though, with plenty of people using Sergey's figures to estimate the worth of friends, enemies and former employers. It was enough for some, including Gemini Rue developer Wadjet Eye to warn publicly against trying to extrapolate earnings from the figures. Some other developers have shared their concerns, but generally the feeling amongst those we polled was that the figures themselves are pretty accurate, with most concerns being over the sort of conclusion jumping which Galyonkin covers in his disclaimers.
"Adam Myers, our analyst, reckons that multiplying the number of Steam reviews by 100 is a remarkably effective rule of thumb for sales."
Failbetter's Alexis Kennedy
"It's giving an decent, though lowball estimate of Sunless Sea sales - we're within the range, but only just," Failbetter's Alexis Kennedy tells me. "But it doesn't indicate anything about the number of units we sold with a 10 per cent launch discount - almost half our sales - and as others have pointed out, the distortion becomes stronger after a game has been through a few promotions with deep discounts. I would be very interested to see how well it worked if it tracked promotions and applied even crude heuristics to their effects."
"The data on Steamspy seems pretty accurate in telling you how many owners there are of a game," added Positech's Cliff Harris. "It is within 5 per cent of the real figure on Democracy 3, for example. Obviously you have no idea how many of those copies are bought on discount, although I think the assumption that the majority of a devs steam income comes from sales is overplayed. Maybe I discount less than others, but it looks to me like you can basically half the income from a game to get a roughly true figure. That might be very different for more competitive genres though, I'm extrapolating from one game there."
TinyBuild's Mike Rose had similar concerns about the potential mixture of bad maths and enthusiasm which accompanies almost any public rush of data, but again felt that the figures as present by Steam Spy itself were pretty solid.
"It seems like the actual numbers are accurate, give or take, in terms of number of activated keys on Steam. The problem, of course, is that 'activated keys' does not correlate with sales and/or revenue at all, and these figures will no doubt be taken the wrong way by a plethora of people, including players who want to big up the games they like, developers who want to pull together a general idea of how much they will sell on Steam, and investors who are looking for the next big genre to plough money into.
"However, the database gives no indication of whether games have been on sale for cut-prices, or whether they've been in bundles, or whether they've been heavily promoted within Steam/outside of Steam... and that's without getting into the key reselling websites that are so prominent now. So it's all pretty interesting data, and it's fun to see roughly how many copies of game X have been downloaded, but I do hope it doesn't mislead a whole bunch of people."
Valve itself hasn't commented on the veracity of Steam Spy's data, and offered no response to our enquiries. However, given the company's track record on communications, it's likely that the lack of condemnation is a tacit acceptance, if not an endorsement. There's also the fact that many companies, including some of the big publishers we spoke to, aren't willing to publish these sorts of sales figures at all, and it's in Steam's interest to accommodate those wishes - there's nothing to be achieved by spilling your customers' private data.
But Steam Spy isn't the work of a hacker. Everything Galyonkin is using to power it is publicly available, it's just not easy to gather and interpret.
"As far as I'm concerned I'm not breaking any Steam guidelines because I'm not collecting any personal data...it's completely anonymous, I'm not storing any personal data, anywhere."
"I'm using the developer's key to the Steam API," he tells me. "As far as I'm concerned I'm not breaking any Steam guidelines because I'm not collecting any personal data. What I'm doing is collecting your ID, which I instantly convert to my personal ID, so I don't store your ID at all, then I check for games assigned to your account. So it's completely anonymous, I'm not storing any personal data, anywhere."
But if this data has been available for so long, and what so sought after, why has it taken so long for someone to refine it?
"It certainly wasn't because it took a long time to develop," laughs Sergey. "Kyle Orland (of Ars Technica) came up with the idea of Steam Gauge and several people immediately jumped into developing systems."
In addition, he continues, there were plenty of people who'd already cracked the safe, they just didn't want to share what was inside. He offers a few examples. "The research company, EEDAR, has a similar system internally, which does exactly the same thing that Steam Spy does, but they just don't publish the data. Then there's DeepGabe which had been developed for a few months but only went public after Steam Spy turned up, they'd been gathering the data but hadn't released it. There's several others doing the same job, but not publishing the data. Everyone was sitting on what they'd gathered but not sharing it. That's a problem I think.
"It's not hard to develop. It took me a week or so, and I was still doing my day job. It's about being open about this sort of stuff. Steam Charts is awesome too, actually. It collects peak concurrent user (PCU) data, which is super-important for free-to-play games. It's not much use to know that DOTA 2 has 50 million owners, because it's free, but when you see it has 1.2 million concurrent players or so, that's important. For paid games, it's the opposite - you need to know about owners, PCU doesn't give you much information."
Galyonkin isn't shy about the margin of error involved in the data. It's front and centre on every statistic and is covered again in his disclaimers, but he says he can and will make the service more accurate, given time and the right resources.
"Because the formula for the margin of error calculation includes a square root, I have to quadruple the amount of data gathered to double the accuracy"
"I'm ramping up the processing power - I've just moved the service to a new server with SSD and stuff like that. I'm going to try an increase gathering speed by 30-40 per cent over the next few days and I'm also optimising algorithms. The problem is that some of these improvements have diminishing returns."
Here, I sense Sergey weighing up an estimate of my statistical, mathematical and programming wherewithal, based on the brief chat we've had so far, judging how technical he should make his explanation. Thankfully, he's right on the money with 'not very'.
"Because the formula for the margin of error calculation includes a square root, I have to quadruple the amount of data gathered to double the accuracy," he explains, patiently. "So it's not going to take you very far with raw power. So what I'm thinking about now is maybe some machine learning algorithms, something like that. I'm not great at machine learning, even though it was the subject of my thesis, that was a long time ago. So I'm thinking about maybe opening an API for the data so people can work with it. So somebody smarter than me can take it further.
"I've seen some people doing interesting things with the data already, and they don't have many tools to work with. They just check the gamepage and write down the numbers. People are doing some interesting lists on NeoGaf, on Reddit, coming to interesting conclusions about genres, geography, obscure markets. Japan for example - everyone thinks it's a small market for PC games, but it turns out it's not so small if you actually do something for the market, with decent localisation. Same with Russia. I knew it was big for Steam, but not so big. Every single game has Russia in the top 3 markets, paid ones too.
"It works the other way, as well. I'm Ukrainian and DOTA 2 is like a religion here. But you look at the US and it's not so big. That's really surprising. Nobody is surprised that Football Manager is mostly UK, but then you see something like Mortal Kombat - it's huge in Russia. I guess they play it on PC rather than console, but it's the number one market for Mortal Kombat Complete edition."
"Really this was a toy for me to play with. I just wanted to have some data to look at"
Sergey obviously has a pretty advanced understanding of analysis, but so far he's steered clear of interpreting his own data, presenting only the raw numbers. Given his day job, surely he must have some plans to offer insights into what the treasure trove he's uncovered means?
"I'm thinking about it," he says, with a verbal shrug which conveys considerable humility. "I'm not sure what people would to see. I have some requests from Universities and other developers, so I can figure out what they would like to see from the site. Right not what I'm going to do is special pages or data enquiries for them to show the results they need. Some people need to research emotional games, some people want to see the correlation between playtime in games of the same genre, or the relation between one game and another. If you play CoD, do you play CS:GO too?
"So I'm taking advice, they're in research, they know more about it than I do. If something good comes up, we can put it on the site for other people to see as well. When you see these big companies selling reports, like Superdata, EEDAR, NPD, you have to have a huge sales team to do that, you can't just sell it as one guy. I don't have enough data to become a full-blown business. It's not a start-up industry, Steam is kind of an established market in a small range. In not like iOS where you can have two companies like AppAnnie and Fiksu competing to sell you data, where people are desperate for some idea of the market. On PC everyone has some idea of the market. I don't think that would work. Really this was a toy for me to play with. I wanted to have some data to look at. I wrote to Kyle, he was busy, so I just thought 'I'll do it myself. It's not that hard.'
"I had my first build in a few hours, so why not share it?"