Difficulty vs XL

Friday, 28th June 2019, 06:43 by bel

How difficult is the game as we go along? What is the relationship between experience level and difficulty? This question has obvious implications for the XP curve in-game.

We can try to answer the question using this measure:

Difficulty at experience level XL = Probability of dying at experience level XL given that we have reached experience level XL.

The data required can be obtained using Sequell queries. I did it for version 0.22. Details are in the spoiler text. I only did it for lvls 1-25.

Spoiler: show

Here's a graph of difficulty vs XL, for all players and for "tenpercenters".

Friday, 28th June 2019, 07:54 by chequers

That looks pretty great overall for tenpercenters. Maybe you could plot the first derivative?

Friday, 28th June 2019, 09:56 by bel

First derivative of difficulty (the data is underneath the spoiler tag):

Spoiler: show

Friday, 28th June 2019, 10:07 by bel

Another idea could be to calculate difficulty in game turns, instead of XL. Assuming a player plays at roughly the same rate throughout the game, we could then see how "smooth" the difficulty level they experience, as they play the game.

The relevant Sequell query would be "!lg * !boring 0.22 / turn<100", and so on. The same methodology used above would work, though the table would be larger. We probably could use smaller buckets at low turncounts and bigger buckets at higher turncounts. We can perhaps take out Chei worshippers and Frogs/Nagas/Spriggans/Centaurs/Felids, but I doubt that if we use game turns, it would make much of a difference either way.

I'll get to it when I have some time.

Friday, 28th June 2019, 18:22 by bel

I created plots for difficulty level based on turns. The turns are on a log scale.

Data under the spoiler tag.

Spoiler: show

Friday, 28th June 2019, 23:25 by Siegurt

Labels on the horizontal axis would be nice.

Saturday, 29th June 2019, 02:08 by bel

There are labels on the horizontal axes -- you just have to scroll a little bit since the image is a little too big.

Saturday, 29th June 2019, 03:47 by Siegurt

bel wrote:There are labels on the horizontal axes -- you just have to scroll a little bit since the image is a little too big.

Oh, hehe

Saturday, 29th June 2019, 10:03 by bel

From the graph above, we can see that there are roughly two parts of the graph: from turn 1-1000, and from turn 1000 onwards.

Here are zoomed-in graphs for the two periods, with significant events labeled. (This is for all players; the tenpercenters graph is similar). Keep in mind that these are all averages.

Monday, 1st July 2019, 15:45 by bel

Some more plots, this time of the difficulty of each floor. Dungeon only goes till D:8 because otherwise the calculations for randomized Lair entrances become too hairy.

Spoiler: show

Comments:

D:1-4 are fairly tough. There's a sharp drop in difficulty from D:4 to D:5 and the difficulty keeps trending downward.
While all players experience a sharp drop in difficulty when they enter Lair, the effect is much sharper for Tenpercenters. In other words, the meme "the game is won by Lair" holds much more strongly for the latter group.

Monday, 1st July 2019, 17:44 by Siegurt

bel wrote:[*] While all players experience a sharp drop in difficulty when they enter Lair, the effect is much sharper for Tenpercenters. In other words, the meme "the game is won by Lair" holds much more strongly for the latter group.

Note that since you're graphs don't go past the lair, it's hard to draw that conclusion (for all we know from the charts, the difficulty past the lair goes back up again)

Monday, 1st July 2019, 17:54 by bel

I didn't calculate it, but I highly doubt that the game becomes much tougher after Lair, given the shape of the other graphs. All indications are that the game keeps becoming easier.

Monday, 1st July 2019, 23:23 by Siegurt

bel wrote:I didn't calculate it, but I highly doubt that the game becomes much tougher after Lair, given the shape of the other graphs. All indications are that the game keeps becoming easier.

I don't personally suspect it does either, but the data doesn't support the conclusion one way or the other, so your statement about the post-lair game doesn't really relate to the graph in question (not to say it's an incorrect statement, just that it's not supported by the graph)

Wednesday, 3rd July 2019, 05:46 by bel

[This discussion happened in another thread, but I think it belongs better here.]

tealizard dissents from this entire approach of using Sequell queries to determine game difficulty:

tealizard wrote:Incidentally, the plots from the other thread are essentially a souped up version of the Sequell query arguments that have such a dubious history here. Unfortunately, the data that goes into these arguments is either too sparse to be useful (when it's based on only players who do something that reasonably approximates optimal play) or just straight up garbage (includes games from players who win less than 80% of the time). Fortunately, there are good arguments from first principles that show how crawl difficulty works, namely starts pretty high as computer games go, then craters after about 3 floors.

tealizard wrote:Though bel's methods of arriving at conclusions about dcss's difficulty curve are faulty, his conclusions are in broad agreement with reasoning from the rules of the game -- i.e. the correct way to generate knowledge about crawl. This idea of using sequell data is not new and people have gone over and over this kind of thing for years, working through all these sorts of exclusions and special conditions.

Similarly, bel's theory that you should reason from the practice of players rather than the rules of the game (including optimal play analysis) is faulty. This is a recipe for a game full of cheez, where the designers pretend they don't know the rules until sufficiently many players figure them out. You guys are just relitigating settled questions, either in the mistaken belief of having found something new or in order to forestall impetus for change.

So, here's my response:

First, I see no articulation of what these "first principles" are, nor any alternate measure of difficulty curve which is derived from these "first principles" or "rules of the game". Crawl is a very complicated game with a hundred different mechanics. How are you going to handle all the mechanics starting from "first principles"? I don't like handwaving.

Second, the problem with using "optimal play" is that Crawl isn't very hard if you play optimally. The "optimal winrate" quickly reaches 100%, and as a corollary, the difficulty level quickly reaches 0. So, crawl is a combination of "intractable" and "uninteresting" if you try to analyze from "first principles" (whatever that means exactly).

Third, the idea that looking at how people who actually play the game is "straight up garbage". I am trying to look at the margin of existing Crawl development, not trying to write my own fork. Why does every Crawl release come with an associated tournament? It's partly for ironing out bugs, but it is also to get a sense of game balance. To give an explicit example of the latter, recall that Gauntlet was added in 0.23, replacing Lab. When people complained (or not) that Gauntlet is too hard, a dev (advil), pointed to Sequell queries in the tournament.

To be clear: I do not claim that Sequell queries inform Crawl development exclusively. I am saying that Crawl development incorporates how people play (quite heavily). On the flip side, I see no evidence whatsoever that Crawl development (or any other game to my knowledge, not that my knowledge is very wide), uses this "hypothetical optimal player" very heavily. (Animate Skeleton and Spectral Weapon still exist. Trog gives berserk at 1*.) They devs do use it for some things, sure, but it's not an exclusive criterion. As I point out above, Crawl is not an interesting game if you look at "hypothetical optimal play", so such a quest would be pointless anyway.

Finally, several mentions of Hellcrawl occurred in the other thread. I have played a ton of games of Hellcrawl and I like it very much. However, most of the degenerate mechanics in Crawl are present in Hellcrawl as well, to the point that "hypothetical optimal play" would have a pretty large winrate. For example, luring exists very much in Hellcrawl. Things are hard at the start of the level but once you clear a little part of the level, you can lure as much as you want. The "doom clock" is generous enough to not matter most of the time.

Wednesday, 3rd July 2019, 09:32 by tealizard

Okay, so:

1. An argument about crawl from first principles would work from the definition of crawl, that is to say from the rules of the game, and from general principles about gameplay, usability, and so on.

2. It is true that with optimal play, difficulty quickly approaches zero, which sounds a lot like your conclusion in this thread. Situations that produce an unavoidable death/loss tend to happen in the entry vault of d:1 and almost vanish in probability by d:4. You can estimate these probabilities using vault weights, spawn weights, odds of various combat events, and so forth.

3. If you develop a novel perspective on the game and you have the means, you should write your own fork. The reason dcss has regular tournaments is that it helps promote the game and create a set of practices that define an online culture around the game. I do not agree that the linked comment lends any credence to the use of Sequell queries as a basis for reasoning about crawl.

Finally: It is absolutely true that hellcrawl maintains some of the worst deficiencies of dcss -- many of these are deep problems that require us to rethink crawl at a basic level. That is why we should move forward in our thinking about crawl, not relitigate issues settled by experience with hellcrawl or rehash arguments against things dcss culture gets right, like optimal play analysis. There is still a lot of potential in continuing the work hellcrawl began, but we are at a juncture requiring a greater leap forward than we have yet seen along the way, even more than the upstairs removal.

Wednesday, 3rd July 2019, 12:26 by bel

I do not understand what any of that really means. Yeah you can theoretically calculate difficulty based on vault weights, spawn weights, odds of various combat events etc. That's a bit like saying that you can model human behaviour by using quantum mechanics because everything is based on physics after all. I'll believe it when I actually see it. It is definitely possible, after all: AlphaZero learnt how to play expert Go by only starting from the rules of the game. However, I'm not aware of any CrawlZero.

None of the difficulty measures in this thread reach 0 at d:4, so whatever your difficulty curve is, it doesn't look like the ones in this thread. In fact, D:4 is the hardest floor in the game (according the measures i calculated above). It is true that difficulty, overall, trends downward, but it does not drop down to zero anywhere near that fast.

The problem with a measure which implies that difficulty level is zero so early means that most of the game is completely irrelevant. It also implies something else: you cannot use "optimal play", "tedium" etc. to remove bad mechanics in most of the game. So, suppose the problem with LRD breaking walls was that it provided infinite digging, and thus it was removed because it is a degenerate mechanic. But if the "optimal winrate" was already 100% before learning LRD, then infinite digging makes no difference at all. So the professed reasoning is completely incoherent if we're working in the "hypothetical optimal" model. This is the main reason why most invocations of "optimal play" on Tavern are completely wrong.

Wednesday, 3rd July 2019, 12:57 by byrel

Going back to the original graph. The problem I always have with the 'lethality by XL' analysis is that XP aptitudes exist. Game progression (and hence difficulty) scale with XP-gained far more than they do with XL. And that makes the random distribution of XP apt blur out features on the difficulty graph.

Wednesday, 3rd July 2019, 13:20 by bel

It's definitely possible. XL is a rather coarse measure -- using XP would be better. Unfortunately, there is no fine-grained measure of XP which I can get from Sequell data. At least, I don't know how to do it, maybe some Sequell wizards can figure it out. Many things in-game depend on XL, so it's not a bad criterion to use.

The coarseness of XL is one reason why I tried out plots using various other criteria.

Wednesday, 3rd July 2019, 13:42 by bel

The comment about AlphaZero above brings to my mind another observation. AlphaZero also learnt how to play StarCraft expertly. I am not a StarCraft player, but from what I understand, AlphaZero just spammed a cheap and fast unit and then used superior micro to take down human players.

This makes me think that in complicated and sprawling video games (like Crawl is, and probably will always remain), there's almost always some broken mechanic somewhere. Optimal play will consist of exploiting the hell out of the broken mechanic.

Wednesday, 3rd July 2019, 13:52 by byrel

bel wrote:It's definitely possible. XL is a rather coarse measure -- using XP would be better. Unfortunately, there is no fine-grained measure of XP which I can get from Sequell data. At least, I don't know how to do it, maybe some Sequell wizards can figure it out. Many things in-game depend on XL, so it's not a bad criterion to use.

The coarseness of XL is one reason why I tried out plots using various other criteria.

Wouldn't it be possible to query based on race, and correct by their XP aptitude?

Wednesday, 3rd July 2019, 13:55 by bel

It's possible, but it would be a lot of work. I might do a little bit of it anyway, because I want to see how some races like Nagas and Felids do in the early game. My impression is that Nagas need to be buffed in the early game.

Wednesday, 3rd July 2019, 14:37 by tealizard

Crawl is not complex enough for your analogy with the relationship between basic physical law and human behavior to hold water, I think. It is certainly true that it's hard to perform explicit calculations, which is why I talk about estimates instead. The reason your results do not match the picture I describe exactly is that they include a lot of noise and nonsense, which goes back to why sequell queries are not a good basis for thinking about crawl.

I actually agree about optimal play being an incomplete guide for exactly the reason that most things make no difference from that perspective. "Optimal play" is not a unique thing, just anything that reaches some theoretical maximum winrate, but reasoning based on the rules of the game is broader than optimal play analysis. A lot of analysis that people say deals with optimal play really deals with some kind of relative advantage versus naive or "normal" play and that advantage is sometimes marginal. Most of this talk is really just about the connection between what's possible according to the rules and best practices for winrate play. There's an admirable perfectionism to this kind of thinking. If anything, it is not applied scrupulously enough.

Wednesday, 3rd July 2019, 18:13 by bel

To compare crawl with another game I play (Brogue), I looked at the difficulty by dungeon depth in Brogue. In Brogue, you have to descend to D:26 to find the amulet of yendor and come back. There is an "extended" (below D:26), but I ignored that part for our purposes. About 13% of games either went into extended or won.

This is generally in line with my Brogue experience that difficulty level remains fairly high and roughly constant throughout the game.

Spoiler: show

Saturday, 6th July 2019, 10:07 by bel

tealizard wrote:Crawl is not complex enough for your analogy with the relationship between basic physical law and human behavior to hold water, I think. It is certainly true that it's hard to perform explicit calculations, which is why I talk about estimates instead. The reason your results do not match the picture I describe exactly is that they include a lot of noise and nonsense, which goes back to why sequell queries are not a good basis for thinking about crawl.

I actually agree about optimal play being an incomplete guide for exactly the reason that most things make no difference from that perspective. "Optimal play" is not a unique thing, just anything that reaches some theoretical maximum winrate, but reasoning based on the rules of the game is broader than optimal play analysis. A lot of analysis that people say deals with optimal play really deals with some kind of relative advantage versus naive or "normal" play and that advantage is sometimes marginal. Most of this talk is really just about the connection between what's possible according to the rules and best practices for winrate play. There's an admirable perfectionism to this kind of thinking. If anything, it is not applied scrupulously enough.

The problem is that I do not see any "estimates" here either. All I see is vigorous handwaving. If you say that the difficulty craters at D:3 (or D:4 or whatever), how did you derive this "estimate" from "first principles"? As far as I can see, this "estimate" is just pulled out from some random bodily orifice.

I am not against argument from the rules of the game; indeed, I do plenty of that sort of thing myself. However, to claim that it is the One True Path to Crawl Enlightenment is nonsense. Crawl is a game, not a mathematical puzzle. This is good, because Crawl makes a pretty good game, but a shitty mathematical puzzle. If anyone actually took "optimal play" seriously, they would advocate having the Orb of Zot on D:4. Or maybe D:2.

Given the Crawl is a game, looking at how people play is a completely valid way of looking at game difficulty. It is tractable and provides real estimates instead of vigorous handwaving. Moreover, you can look at subsets of players (for instance, I looked at "all players" and "tenpercenters" above) to have an idea of how the game feels for various kinds of players.

Saturday, 6th July 2019, 15:55 by tealizard

Your methodology, at best, measures something about dcss players, not dcss itself. I'm hearing about handwaving from someone who purports to measure the difficulty of a game by looking at games from people who barely know how to play or who aren't really trying. What you are doing in asserting the validity of this silliness is worse than handwaving.

Estimating "difficulty" in the sense you are talking about would involve probabilities of "unavoidable death" scenarios. You can find examples of those kinds of calculations on this forum. This is the sort of thing you should try thinking through yourself because it's moderately complicated and very tedious to go through on a forum. The point is that when you get a reasonable set of combat capabilities (skills, spells, items) and hp, you have so many options it's hard to contrive a situation that will kill you with significant probability given best tactics -- best practices make a lot of these situations almost impossible in the first place.

Thursday, 11th July 2019, 00:55 by bel

I found this cool series of posts which was done by some guy named Colin Morris a couple of years ago.

I half-remembered this from a post on Tavern back then. I had some worries about the approach.

Sunday, 14th July 2019, 22:39 by bel

Hellcrawl difficulty level in Dungeon. Details are under spoiler tag.

Only looks at games on CPO. Thanks to chequers for logfile help.

Spoiler: show

It looks like Hellcrawl shares the difficulty drop of DCSS up to D:8 or so. However, the removal of Lair gives an upward push to the difficulty in the rest of the Dungeon.

Monday, 15th July 2019, 00:24 by tealizard

hellcrawl is so fuckin' good

Monday, 15th July 2019, 04:09 by bel

Updated Hellcrawl graph. Since Hellcrawl is more-or-less linear till Vaults:3, I plotted the difficulty assuming the canonical order (D -> Orc -> Sbranch -> Vaults).

Spoiler: show

Thursday, 18th July 2019, 14:57 by bel

Nethack 3.6.2 difficulty plot. The turns are on a log scale

Spoiler: show

Thursday, 18th July 2019, 22:34 by Implojin

bel wrote:Nethack 3.6.2 difficulty plot. The turns are on a log scale

This is a beautifully concise summary of why I don't like nethack.

Thursday, 18th July 2019, 23:34 by bel

To be fair, the nethack plot doesn't look much different from the DCSS plots here.

Friday, 19th July 2019, 00:11 by Implojin

Wednesday, 24th July 2019, 16:00 by TheMeInTeam

bel wrote:The comment about AlphaZero above brings to my mind another observation. AlphaZero also learnt how to play StarCraft expertly. I am not a StarCraft player, but from what I understand, AlphaZero just spammed a cheap and fast unit and then used superior micro to take down human players.

This makes me think that in complicated and sprawling video games (like Crawl is, and probably will always remain), there's almost always some broken mechanic somewhere. Optimal play will consist of exploiting the hell out of the broken mechanic.

I watched those games vs the two pro players, and during Wings of Liberty I was a Diamond 1 player myself (not elite, but very good at the time). Alphazero was significantly more advanced than that at the strategic level, I guess the machine learning picked up on enough reactions during its training. It's true that it used inhuman unit control in a way that effectively distorted unit balance though.

Anyway, I'm not convinced "theoretical optimal" is a valid basis for evaluating difficulty. There are levels designed in Mario Maker or even Chicken Horse that are too difficult for anybody on this forum to beat...yet "optimal" play (such as what you'd get out of TAS) would beat them every single time. This is demonstrated to an extreme with hack-specific levels of Super Mario World like "item abuse TAS". Those are STILL 100% winrate under "optimal play", but no human alive today can beat them without tool-assists.

Using player data and death rates is superior method of evaluating difficulty, assuming we're evaluating difficulty for human beings. You do run into a noise factor with literally impossible positions, however, due to RNG or bad design (some matches in Chicken Horse create temporarily impossible levels until people deploy obstacle destruction items or create a new path - it's not fair to call these "difficult" because there is literally no outcome variance between elite play and true newbies). Same goes for "true" RNG deaths in crawl - those rare events are not an example of difficulty, they are a noise factor complicating an evaluation of difficulty.

Wednesday, 24th July 2019, 17:04 by Siegurt

TheMeInTeam wrote:Using player data and death rates is superior method of evaluating difficulty, assuming we're evaluating difficulty for human beings. You do run into a noise factor with literally impossible positions, however, due to RNG or bad design [....] Same goes for "true" RNG deaths in crawl - those rare events are not an example of difficulty, they are a noise factor complicating an evaluation of difficulty.

The general consensus among experienced players on the forums has been that the actual instances of RNG-produced unwinnable situations (where there were no actions you could have possibly taken to avoid dying, or getting into a situation where you would have died) are exceedingly rare, to the point where the noise they produce isn't going to be a significant factor in any evaluation with a robust data set.

The problem isn't that win rates using player's best efforts to win aren't reflective of difficulty, but rather that the signal to noise ratio in the dataset is too large to derive actual difficulty from it. There's flat-out no way to say "this game they weren't really specifically trying their hardest to win".

The much larger problem is that a very large amount of the time people get into situations where they could avoid death, have some idea of what it might take to do so, and choose not to because it would be flat out more trouble than dying and starting a new game. I don't know if death by "Eh I might be able to escape this horrible situation if I burn all these consumables and subsequently try to sneak past this too-hard combat by trying enough times and waiting and shouting in enough locations, but I also might beat it with a little luck. F-it, let's give it a try, worst case I lose 10 minutes" can really be considered "difficulty".

My sense is that there's a very large number of deaths which fall into the category of "I don't feel like expending the effort on what I know is probably more likely to keep me alive, when there's a reasonable chance I'll survive with much much less effort", There's also a not-insignificant number of cases where someone is specifically trying to lower turncount, which may increase risk and possible maximum score at the expense of win rate. However, there's no way of separating those from the deaths which are caused by a true lack of skill.

If that hypothosis is true, it makes any conclusions about difficulty fairly suspect derived the data we have available to us.

Its the same reason that we exclude !boring games from the results (early escapes without the orb, or quits) because the point at which the game ends isn't reflective of the best effort of the player to win, there's just a much larger subset of games than can be readily identified which don't reflect players best efforts to win.

Wednesday, 24th July 2019, 18:03 by bel

Zerothly, I should say that I primarily made these plots for fun. If they lead to some useful insights, that's a bonus.

That said, we should keep in mind the adage that "all models are false, some are useful". So, what is that I'm trying to do here? I'm trying to look at places where Crawl's difficulty curve is out of whack with the "ideal" difficulty curve -- whatever that might be.

If we keep this goal in mind, the "optimal play" difficulty measure may or may not be true, but it's completely useless -- because it says that everything after D:2 (or D:3) is completely irrelevant, because "optimal winrate" quickly reaches 100% (or close enough to not matter).

Using Sequell data is just another way of "playtesting" the game and seeing how hard it is. Instead of a few people playing the game and offering feedback -- saying this branch is too hard, the AC on this monster could be reduced, and so on -- I am looking at a bigger data set. Is the data set dirty and full of noise? Obviously. But there is some detectable signal in the noise.

I looked back at the Hellcrawl thread in CYC. Suppose, we look at the point where the monster spawns were being tuned. What I see is people playing the game and offering their thoughts on whether X is too easy, Y is too hard, this branch needs cuts etc. If people worked on the basis of "upstairs exist, consumables exist, so the game is completely trivial after D:2, so there's no point in tuning monster spawns.", then nothing would be done.

To again clarify: I am not opposed to arguments from the rules of the game etc. I do plenty of that sort of thing myself. I only claim that this kind of argument is not the only way to analyze game mechanics. Other data, appropriately used, can be useful -- indeed, much more useful in many cases.

Wednesday, 24th July 2019, 18:21 by TheMeInTeam

The general consensus among experienced players on the forums has been that the actual instances of RNG-produced unwinnable situations (where there were no actions you could have possibly taken to avoid dying, or getting into a situation where you would have died) are exceedingly rare, to the point where the noise they produce isn't going to be a significant factor in any evaluation with a robust data set.

That might be broadly true when you take winrates in crawls as a whole, but remember in this thread some of the discussion is centering on the exact portion of the game where RNG deaths are (by far) most likely to happen: the first few levels of dungeon (D:1 in particular). When you focus the scope so narrowly on this area you have less opportunity for it to wash out.

More dangerously, I've not seen conclusive data or even a consistent means of evaluating "RNG death" vs "death to obvious mistakes" vs "death to choices made that a crawl-equivalent of alphazero might have avoided". I do know broadly, as a somewhat experienced player myself now, that very few games are unwinnable, but reject that anybody w/o substantive, non-anecdotal/recency biased data should have *confidence* in making estimates.

The problem isn't that win rates using player's best efforts to win aren't reflective of difficulty, but rather that the signal to noise ratio in the dataset is too large to derive actual difficulty from it. There's flat-out no way to say "this game they weren't really specifically trying their hardest to win".

We have roughly as much evidence supporting this as we have when players want to blame their losses on RNG though. Our "noise" filters are, at best, very crude.

Is it really unfair to claim that the vast majority of players that begin a game of crawl attempt to win it? Is there a non-arbitrary way to parse a "bad player mistake that gets them killed" from an "80% winrate player mistake that nevertheless still gets them killed when alphaDCSS wouldn't have died"? I don't see a clear reason for the distinction, right now.

I don't know if death by "Eh I might be able to escape this horrible situation if I burn all these consumables and subsequently try to sneak past this too-hard combat by trying enough times and waiting and shouting in enough locations, but I also might beat it with a little luck. F-it, let's give it a try, worst case I lose 10 minutes" can really be considered "difficulty".

It's actually non-trivial to define difficulty in a way that would not count such scenarios but still consistently count things you would consider "real difficulty". Quoted scenario is not unlike a failed early rush attempt in a TBS/RTS for example, complete with the benefit if your gamble pays off (more XP and more consumables still available early should you succeed, but lower expected success rate on average compared to alternative strategies...also lower IRL time investment).

My sense is that there's a very large number of deaths which fall into the category of "I don't feel like expending the effort on what I know is probably more likely to keep me alive, when there's a reasonable chance I'll survive with much much less effort", There's also a not-insignificant number of cases where someone is specifically trying to lower turncount, which may increase risk and possible maximum score at the expense of win rate. However, there's no way of separating those from the deaths which are caused by a true lack of skill.

If that hypothosis is true, it makes any conclusions about difficulty fairly suspect derived the data we have available to us.

Yet earlier you claimed that other non-difficulty outcome sources wash out over large sampling. I don't see why this isn't true for the occasional "not trying" outcomes also. Especially when you can't actually estimate % in either case.

Importantly, we can reasonably hold these deaths to be equally likely for most changes made to early/late game crawl, and thus make meaningful conclusions about player death rate data changes based on mechanical changes in spite of them. Though reducing optimized play = excessive tedium isn't a bad side goal/benefit.

Its the same reason that we exclude !boring games from the results (early escapes without the orb, or quits) because the point at which the game ends isn't reflective of the best effort of the player to win, there's just a much larger subset of games than can be readily identified which don't reflect players best efforts to win.

I think you are vastly underestimating what "best effort to win" really means. If we're being honest you could throw out nearly the entire sample set of games. For example most people don't study the code, fsim weapon damage, look up every monster in every encounter until they know all stats/hd/etc by heart, and ensure their body is well rested and definitely not on any substances like alcohol with long calculations of all potential actions for every non-trivial encounter without exception.

Since they aren't doing those things and more, they're technically not giving their "best effort to win" and can be safely not counted, right? Except that kind of position neglects how the overwhelming majority of crawl players (including many players with very high winrates!) actually play crawl. The moment someone presses "autoexplore" once, they should instantly be removed from consideration? I'm not convinced. But this is implied if we're really talking about "best effort to win".

Saturday, 27th July 2019, 03:42 by Shard1697

Not directed at one specific person-if you mean "difficult" to only mean "difficult for the best players", you're talking about something which is not what most people think of when they hear the word "difficult".

Normally when people say something is difficult they mean it is hard for an average player, and then things that are hard for a skilled player are REALLY difficult.
Even compared to places where people discuss other RLs, fighting games, shmups etc, I don't feel like people speak past each other in this way like they do here. The average player matters if you're wondering what is difficult. Situations that kill the average player are difficult, and if you don't think so you're really asking a different(more specific) question than "is this difficult".

Saturday, 3rd August 2019, 18:52 by bel

Back to DCSS.

How much harder is the second rune S-branch as compared to the first rune S-branch? By my rough calculation, the second rune branch is about twice as easy as the first one.

Spoiler: show

Sunday, 4th August 2019, 08:33 by duvessa

bel wrote:Then I see how many people die in the first S-branch without finding a rune.
Code:
!lg * !boring 0.22 br=(Snake|Shoals|Spider|Swamp) urune<1 x=count(gid) <Sequell> 4821 games for * (!boring 0.22 br=(Snake|Shoals|Spider|Swamp) urune<1): count(game_key)=4821

That isn't what this query gives you. It gives you how many people died in either S-branch while having no runes. Players often do levels 1-3 (and part of 4) of an S branch without getting the rune, then do levels 1-3 of the other S branch - this query catches deaths that happened in the second S branch as well. You're taking a bunch of the second S-branch deaths and counting them as first S-branch deaths.

(There's also a survivorship bias issue here; players that get the rune from the first S-branch are likely to be better than the players that don't, so the "win" rate for the second S-branch is going to be higher even if there's no difficulty difference.)

Sunday, 4th August 2019, 09:48 by bel

Yes, both of the points are true.

As I said, I made some approximations in my calculations. In particular, I assumed that whenever a person enters an S-branch, they do it till the end (or till they die). This is sometimes not true because branch ends are much tougher than the rest of the branch.

However, because DCSS is highly non-linear at the time of the rune branches, I don't know of any easier way to perform the calculations. Someone could just as well decide to do Elf in between, or Depths or whatever.

I suppose I could look at milestone data within a game, but my Sequell skills aren't that good.

Another way could be to just look at deaths on Sbranch:$ (with and without a rune), because most of the action happens there. However, that has other problems: the first few floors of the first S-branch are (probably) more difficult than the first few floors of the second S-branch, so ignoring both of them doesn't seem right.

Another way could be incorporate the XP of the character in the deaths somehow -- the XP of a character doing their second rune would be higher. But higher XP characters die less anyway because DCSS becomes easier as the game goes on.

As for survivorship bias, well that is always present, in all of the queries in this thread. It's probably bigger here than in many other queries because most people who have played DCSS have never got a rune.

Monday, 5th August 2019, 17:59 by TheMeInTeam

Another way could be to just look at deaths on Sbranch:$ (with and without a rune), because most of the action happens there. However, that has other problems: the first few floors of the first S-branch are (probably) more difficult than the first few floors of the second S-branch, so ignoring both of them doesn't seem right.

I'm not sure we can make this conclusion so easily. People frequently select S branch order based on what they anticipate will be the easier one for their build to do first. In extreme cases like a mummy drawing spider as an S branch, the 2nd branch is so much harder/more dangerous than the first that you skip it in favor of elf/vaults/depths first. Frail species w/o a shield or Dmsl might feel the same way about shoals. This will of course make dying before 2nd rune technically more likely (3 areas compared to 1), but the build that enters the branch for its 2nd rune will be much stronger than a typical 2nd S branch.

There's a lot of noise for this query in particular. Enough that I expect it's less predictive than ignoring it and just using XL.

Monday, 5th August 2019, 19:33 by bel

Shoals might be an outlier because it's often harder than Vaults: 1-4. The other three are of roughly similar difficulty imo.

I wanted to develop some kind of quantitative measure for my general feeling that "there should be only one S-branch in the game because the first one is interesting while the second one is a slog."

After running the queries, I'm not so sure about my feeling. By the rough measure above, about 12% of attempts for the 2nd rune in the S-branches end in deaths; so it's a bit of a stretch to call it a "slog". Still, I think it would be good overall if there was only one S-branch.

Tuesday, 6th August 2019, 08:54 by Utis

I cannot contribute to the main point of this discussion (and I hope I'm not derailing anything). But the following might be tangential: I do not think that a flat difficulty curve contributes to keeping an RPG game interesting.

I think this is generally the dilemma of RPGs as opposed to, say adventure games or a strategic board games: On the one hand, "character improvement" is their very core and their strategic point of interest. On the other hand that "character progression" essentially goes like this: At level 1, you do 1 point of damage per round to a goblin with 10 HP. At level 100 you do 100 damage per round to a dragon with 1000 HP.

Since, personally, I'm unable to suspend my disbelief in that illusion, I really have difficulty to continue playing any RPG with any sort of attention in the long run. Most games, including crawl, deal with this dilemma by making things more complicated as the game continues. That means that gaining accurate knowledge of game mechanics and their interactions becomes the actual objective in the long run, far more than play itself. I don't know if there is a better solution. But I do think that if you keep difficulty constant, you're dealing with a game design paradox.

Tuesday, 6th August 2019, 15:40 by TheMeInTeam

bel wrote:Shoals might be an outlier because it's often harder than Vaults: 1-4. The other three are of roughly similar difficulty imo.

I wanted to develop some kind of quantitative measure for my general feeling that "there should be only one S-branch in the game because the first one is interesting while the second one is a slog."

After running the queries, I'm not so sure about my feeling. By the rough measure above, about 12% of attempts for the 2nd rune in the S-branches end in deaths; so it's a bit of a stretch to call it a "slog". Still, I think it would be good overall if there was only one S-branch.

Players can skip an S branch if they want. I've actually done this recently as I mentioned in another thread (skipped spider as a mummy).

I'm not convinced shoals is more dangerous than other S branches. Your game data is aggregate across versions correct (edit: seems not, just for .22, but the point is similar then). Swamp is considerably more dangerous now than it used to be, with long range smite-targted grasping roots and dangerous projectiles. It has only slightly less stair pull/push than shoals too.

I suspect shoals might have more difference in survival between experienced and inexperienced players compared to other branches. This is because there are many extra tricks in shoals:

- Wand of flame on many things to damage with steam + block LoS
- Hexing wands or hex spells tend to be very effective on many of the enemies, including threatening ones
- Very few things in shoals see invis, a trait shared with spider but not swamp/snake
- Very few things in shoals have rPois, so magic like OTR or mephitic cloud can trivialize many of its encounters if you happen not to have hexes

In addition to this, snake/shoals tend to have shops and shoals in particular has the best loot drops of the S branches, so it power spikes the player more. However, newer players taking their MiBe there the first time might not anticipate the dangers of barbs, mesmerize, getting blown off stairs, getting stairs flooded, or getting pin-cushioned by mighted ranged enemies...so they die with otherwise strong setups because they don't know the above.

Contrast this to alternative S branches. You can fly over deep water to avoid grasping roots and still use wand of flame in swamps similar to shoals...but there's not a lot of extras there. In snake the predominant strategy is to walk away, mostly complicated by guardian serpents. Spider is mostly fast melee stuff, so once you learn to respect tarantellas and not to chase orb spiders deep into the fog your only real consideration is ghost moths.

Once you learn the extra variety in Shoals, I don't think it's more dangerous than Swamp. But it does have a larger learning curve initially and this might skew its results, further complicated by shambling mangroves/thorn hunters being more dangerous in swamp now than previously...especially the mangroves.

IMO Vaults 1-4 would only have a lower death rate than shoals in a vacuum. People go into shoals at significantly lower levels.

Friday, 24th January 2020, 03:04 by bel

I recently read some things which made me think about this thread. I'll briefly outline the articles and say how they connect to this thread.

The first is this article: How Slay the Spire's devs use data to balance their roguelike deck-builder. Slay the Spire is a hybrid roguelike/deck-building game where you choose cards to fight against enemies. Since there are too many interactions to try to analyze, the devs looked at the game logs to determine balance. For instance, one simple measure they used is to see how often a player chose a card when they were given the option to choose it; and then the devs connected this choice to whether the player won or not. If a particular card is chosen by most winning players, it's likely to be overpowered.

From the perspective of the players themselves, something similar was created by one of the top players in the world (Jorbs). He created a Google Doc with the cards chosen from his runs, using a similar methodology. He assigned each card an Elo rating, with a higher rating meaning that he values the card more.

------------------------------

To connect it to what we were discussing above, let me quote my post above, where I contrasted two ways of looking at Crawl's difficulty level. One is arguments using the rules of the game, the other is looking at player data.

bel wrote:Zerothly, I should say that I primarily made these plots for fun. If they lead to some useful insights, that's a bonus.

That said, we should keep in mind the adage that "all models are false, some are useful". So, what is that I'm trying to do here? I'm trying to look at places where Crawl's difficulty curve is out of whack with the "ideal" difficulty curve -- whatever that might be.

If we keep this goal in mind, the "optimal play" difficulty measure may or may not be true, but it's completely useless -- because it says that everything after D:2 (or D:3) is completely irrelevant, because "optimal winrate" quickly reaches 100% (or close enough to not matter).

Using Sequell data is just another way of "playtesting" the game and seeing how hard it is. Instead of a few people playing the game and offering feedback -- saying this branch is too hard, the AC on this monster could be reduced, and so on -- I am looking at a bigger data set. Is the data set dirty and full of noise? Obviously. But there is some detectable signal in the noise.

I looked back at the Hellcrawl thread in CYC. Suppose, we look at the point where the monster spawns were being tuned. What I see is people playing the game and offering their thoughts on whether X is too easy, Y is too hard, this branch needs cuts etc. If people worked on the basis of "upstairs exist, consumables exist, so the game is completely trivial after D:2, so there's no point in tuning monster spawns.", then nothing would be done.

To again clarify: I am not opposed to arguments from the rules of the game etc. I do plenty of that sort of thing myself. I only claim that this kind of argument is not the only way to analyze game mechanics. Other data, appropriately used, can be useful -- indeed, much more useful in many cases.

If you watch Jorbs's streams, you'll see that he uses a lot of reasoning using the rules of the game. But he (and the Slay the Spire devs) also do a lot of player-data based analysis. As I tried to argue above, this method is not illegitimate because it cannot give the "true" or "optimal" difficulty level -- Crawl and Slay the Spire are games, not mathematical puzzles.

Friday, 24th January 2020, 21:08 by b0rsuk

Re: bel

Slay the Spire is a clone of a popular and influential board game Dominion. It started a new board gaming genre. What they use for balancing reminds me of neural networks. Another card-based board game, Race for the Galaxy, has a computer version with neural networks. You can download it for free. The way I understand neural networks work, it doesn't try to understand the game at all. Rather, the programmer set a bunch of "sensors" to watch for specific conditions and rate each card in each situation. So, for example, it may work like this: if I have This card in hand, and there are 9 victory points left, and the end of the game is 4 cards away, then playing This card gives 0.273443321 chance of winning. In the same situation, playing That card would be 0.27653462 chance of winning. Therefore, play That card.

The downside of neural networks based AI is, I believe, is that it can't justify its actions to save its life. It just plays those cards in a very specific situation because thousands of simulated plays has shown that leads to victory with such and such probability. I made extensive use of computer version of Race for the Galaxy when learning to playing it. It's a brutal teacher and it's very effective, but you need to very carefully analyze its actions.

Even so, I've found that the implementation using neural networks has a bit of a weakness at strategic level (planning a couple of turns ahead). There are combos it consistently undervalues. I guess the programmer didn't catch all the factors that influence the outcome of the game, so neurons/sensors couldn't catch them.

Neural networks tend to work especially well in card games, because the data is fuzzy but states relatively well defined. Player might or might not have specific cards in hand, so exhaustive enumeration is prohibitively expensive, and time-limited algorithms can produce unsatisfactory results (Las Vegas / Monte Carlo, I don't remember which is which). Also in card games, available decisions and factors tend to be much better defined than in a game like Crawl, which has a great deal of strategic factor in skill training and spells, and which monster or item might or might not show up. I would be highly surprised if someone could define an effective neural network for DCSS. The complexity reminds me of 4X games, which are infamous for bad AI (and that's why they ultimately failed).

-----------

I think the way people play DCSS has a bit of herd mentality. Tavern, reddit, the wiki all suggest certain approaches which are not necessarily true or up to date. I used to play Crawl before draconians were cold-blooded. Nowadays the most common ranged enemy in Lair is a rime drake (frost), not firedrake. I very much doubt Lair is a great choice for a young draconian to train.

Swamp HAS became a lot harder compared to a couple of years ago. It has a distinct semi-open layout, the old Swamp was very open. On the plus side, it limits vision. The downside is that stealth might not be as good in there because fighting noise is going to attract stuff anyway (I recently won a DsFi with antennae, so I got to experience first-hand how much fighting noise actually does, and my stealth was two stars short of perfect). The same character reaped great benefits from Stealth in Shoals. Swamp is quite tricksy now, a spriggan druid can cast Might on a hydra. Old Swamp was infamous for requiring poison and perhaps electricity resistance. In new Swamp, I find that area damage/summoning is more useful because there are more choke points, and a flaming weapon because hydras are harder to spot from farther away.

The branch that has probably changed the least is Spider Nest (entropy weavers). Even Snake pit got naga sharpshooters and shock serpents.

---------

Also players tend to over-react to nerf changes. Nerfs have a big psychological effect in that many players seem to default to the second best option rather than actually trying the thing.

--------

My final 2 cents - judging by these forums, it seems Necromutation and especially Statue Form has became somewhat of a "ascension kit" in extended. I don't remember that being the case several years ago. A new meta has formed. Meta is not always right, but people tend to follow others when in doubt.

Monday, 27th January 2020, 21:08 by TheMeInTeam

Most people say necromutation is bad. Statue form has a divided opinion. Some people really like it in extended, others don't bother or can take or leave it.

Many recent nerfs have been:

a) removing the option completely on shaky/overly general justification (no way to "actually try the thing").
b) nerfs to already-questionable options on the stated/unsupported basis that they were too strong (confusing touch, agony, dispel undead). Dispel undead was then made less expensive to learn/use, but amusingly the main issue with it has barely changed.

Stuff like BVC was nerfed but remains a very effective spell. Stabbers are somewhat brought down despite that they were already on the weak side. It's a mixed bag, some of the other magic changes were good (VM is likely too complete with its starting book, IE is a challenge pick unless you want to just barely train it and use freeze until you transition to a weapon).

I'd be interested to see how alphaX would do with crawl and 4x games. I expect it could probably become very good at the former, eventually outperforming top players. The latter would be an issue with modern 4x because their performance is so *expletive* awful that it might take even our best computers ages to do say a million games to learn. You'd have to fix the game to quit animating crap off-screen (from player perspective) first for example. Even during a turn with animations turned off, I can somehow make inputs faster than the game can handle them, and run afoul of input buffering. Webtiles has an excuse (connection), 4x does not. That's pathetic. No way you can sim tons of games with an AI using a game in that sorry state. IMO that's why 4x has mostly failed more so than their complexity/AI.

Monday, 27th January 2020, 22:50 by Siegurt

b0rsuk wrote: The way I understand neural networks work, it doesn't try to understand the game at all. Rather, the programmer set a bunch of "sensors" to watch for specific conditions and rate each card in each situation.

This: https://www.youtube.com/watch?v=aircAruvnKk is a great 20 minute explanation of how neural networks work, additionally I'd highly recommend all the 3Blue1Brown series, it's very clear and digestible.

Monday, 20th July 2020, 21:06 by bel

Continuing my comparison with other roguelikes I play. Here I look at Slay the Spire.

This is a very rough attempt, with a lot of shortcuts. Details under spoiler tag.

The difficulty level ramps up in the beginning, then stays roughly constant through most of the game. The final boss fight is the hardest one, as one would expect.

Spoiler: show