Quarterly Cup Reflections

October 9, 2023 at 7:57 PM

I won the tournament! Would not have predicted that would happen… It ended up being a pretty strong victory, with about a 25% tournament take. This means that I win 25% of the $0 prize pool which, alas, only comes out to $0 if you do the math. Apparently I may get a Metaculus-branded hat though, and you really can’t put a price on that. And of course glory. You can never have enough of that.

I’m now attempting to defend my crown in the 2nd Quarterly Cup which is already underway. I’m not going to write about every question, but I may pick and choose a few to share thoughts on if I feel like it. Also, it’s not too late to join if you’re interested!

For now, here are 10 thoughts about forecasting, Metaculus, and the forecasting community that have been bouncing around in my head the last few weeks:

Forecasting is a great way to follow the news. when you’re reading the news to make a prediction on something, you read it in a very different way than a typical reader would. Instead of passively browsing, you actively seek out articles in order find specific information. You formulate specific questions - “How long do Israeli court decisions typically take?” “How long do we expect this one to take?”, and then try to extract that information from news articles, which may answer it explicitly, or may just offer hints.

I think this style of news-reading results in much better comprehension than a more passive, browsing style of reading. I’ve found that I’m much better informed about topics I’ve made predictions on compared to topics I’ve just read passively about in the news.
Writing about every forecast was difficult - but also is probably a big part of why I did so well. For the first 2/3 of the tournament, I tried to write something about my initial prediction on every question. I settled on a style where I’d spend a few paragraphs summarizing the question context so that a reader who wasn’t closely following the topic could understand the question, followed by a few paragraphs where I explained how I landed on my number.

There were lots of tournament questions, so this took a lot of time. I was usually just learning about each topic, and it took serious effort to summarize all my background reading into a few concise paragraphs. Also, it forced me to actually write down the specific reasons I landed on the number I did, which often were very vague, intuition-based reasons, rather than easy-to-justify baserates and statistics.

As painful as this process was, I think it was also really good for my predictions. After all, I think I’m the only tournament participant who did this and also the only tournament participant who won, and those things may be related! Concretely, it ensured that on every question, I knew the context of the question well-enough to explain it fairly concisely, and it forced me to understand what factors I was weighing and how much when I made the prediction. This is helpful in the moment, as it exposes things you were tricking yourself into thinking you understood, and is doubly useful later on, as you now have a historical database of how you approached different questions and what the result was.
Am I just a natural at this? If you decided to take up a random sport one day, you (without practicing) joined a tournament hosted by one of the leading institutions behind sport, and then you won that tournament, what would you conclude?

If it’s a very luck-based sport, you probably just got very lucky. But if not, I think you should conclude that both (a) you’re probably naturally pretty good at that sport, at least more than the average person, but also (b) you probably weren’t competing against that strong of a field. This is how I feel about my performance. While there was definitely luck involved, there were too many questions for it to really be about luck. But also, if I was really competing against a bunch of elite forecasters, I should not have won. For the first handful of questions, I didn’t really even know how the tournament scoring worked and pretty quickly plunged my score well into the negatives. I also completely mailed in a few questions, some of which resulted in very bad scores. So I should have been very beatable.

I think the best explanation here is, (a) forecasting is a very niche thing that not many people do, so the average field in any tournament will be pretty weak, (b) this specific tournament had no prize pool, so if you’re a very good forecaster, you may have just skipped it, and (c) I do have some attributes that make me especially good at forecasting - more on this below

I would also add that I joined a different forecasting site, Manifold Markets back in August, and in 3 months have turned the 500 starting ‘Mana’ you get when you sign up into 8500 mana, and have specifically made a point to not do any research and just buy/sell based on intuition. Again, not sure what to conclude here, but it seems very possible that these sites are just full of people who are terrible at predicting things, such that it’s easy to do quite well by just being half-decent.
I’m particularly well suited to do well in forecasting tournaments - I’ve been “very online” for many years and am quite good at quickly tracking down information about questions. I’ve also read, with genuine interest, about such a wide variety of topics (geopolitics, tech, sports, science, political polling, etc.), that while I usually don’t know much about the specifics of a question, I have lots of ‘reference points’ of knowledge that help me orient myself when reading about it. In other words, I’m a generalist, with surface-level knowledge a very many things, even if lacking in-depth knowledge in specific areas. The result of this is that when faced with a new question, I can usually understand it pretty well within an hour or two of reading.

I’m also the kind of person that enjoys tracking 15 different events as they unfold by meticulously doing filtered Google searches each day and keeping a library of bookmarks and Twitter accounts to check in on repeatedly. This is pretty important in a tournament like this one, where you can have questions about rapidly unfolding events, which require you to track down up-to-date information every day. I can completely understand how people would find this overwhelming, or just not enjoyable. But for whatever reason, I find it very fun. I was already sort of doing this just in a more passive way before I started forecasting, so it felt like a very natural transition from my prior online habits.
Different styles of questions, different styles of predictions - Different styles of questions lend themselves to very different prediction strategies, all of these strategies can do well for the right types of questions, and choosing the right strategy for each questions is really important.

For some questions - “will the Dow Jones cross such and such threshold by this date?” - you should clearly just use a model based on some historical data and keep intuition out of it. I can think of several examples of where the community was clearly letting vibes move them away from what the quantitative model would say, and where by just sticking to a historical average, I scored well on the question.

For other questions - will India’s rover successfully land on the moon? - there is a very clear baserate you can start with, and then adjust from there based on the specifics. In recent decades, landers have about a 40% success rate. India has failed once, but got close, and has a competent space program. So probably a predicting of a bit over 50% is a reasonable guess.

Some questions didn’t lend themselves well to quantitative analysis, and lacked a clear baserate to start from - will the Black Sea grain deal be revived before July 16th? - this style of question relies much more on analyzing different news sources and trying to gain an intuitive sense for what will happen. This is an example of where I tried to by overly quantitative in my approach early on and latched on to a poor baserate, and ended up doing poorly. I may have been better off just reading a bunch of articles and thinking “does it seem like Putin cares about this deal?” and going with the number that felt right.
To comment or not to comment… Tournament-style forecasting leads to strange commenting incentives, which probably reduce the quality of the community forecast. Forecasts are scored such that the important thing is doing well relative to the field, rather than doing well in an absolute sense. To get a really positive score on a question, you need to be significantly closer to the final resolution (e.g. 10% when the community is at 40% on a No-resolving question), for an extended period of time. This means if you’re trying to win, the incentive is to not share important information that might give up your edge over the community.

There were a few times in the tournament where I realized that due to some hard-to-track-down piece of information that the community probably wasn’t considering, the real probability should have been <1% for a question, while the community was sitting at 20 or 30%. With a simple comment, I could have easily shifted the whole community to my number, thereby improving the performance of the site. But this would have eliminated any points that I would gain in the tournament from being more right than the community, so I didn’t comment.

I think this is the biggest advantage true prediction markets like Manifold have over a site like Metaculus. In prediction markets, you could just put a huge amount of money on No to move the market and gain a big profit, or put as much as you’re willing to spend and then reveal your information so that the community follows. In other words, you can cash in on your special knowledge immediately, whereas on Metaculus you can only cash in over a long time period
Who is this for? Is any of this actually useful? Most of the forecasts on Metaculus seem pretty useless, including most of the questions I was forecasting. As in, I often will read a question and think, “what specific person would be looking at this market to help decide whether to take one action vs. another?” and I’m unable to think of almost anyone. Lots of the questions are interesting which I guess is a usefulness in itself, but I’m not sure they’re useful enough that people would, say, pay lots of money for the information.

For some questions, I can think of people that might find them useful, but they’re people who would have so much better “insider-knowledge” about the question, that the usefulness of the forecast is diminished. For example, I can think of people who might really care about what % chance there is that the Black Sea Grain Initiative will be renewed - leaders in affected countries, people in the grain industry, etc. But those parties are also likely to have much more direct knowledge about the state of the negotiations, to the point that their internal estimate should be better than my outside-observer estimate that’s based on news articles.

Unfortunately, one of the best cases I can think of - questions around the likelihood that LK-99 was a room temperature superconductor - was actually a mini-disaster in terms of prediction market credibility! I think you could reasonably argue that the wildly over-optimistic forecasts on these sites were the major driver behind the hysterical online coverage (I recall many prominent people on Twitter posting the Manifold estimate very early on as justification for why people should be freaking out more), which is to say they had the exact opposite effect that prediction-market proponents claim they are supposed to have!

This isn’t to say that I don’t think these sites can be useful, just that I think the vast majority of questions are phrased in a way that makes them useless, or are about topics where the relevant stakeholders who would care about the question are likely to have superior insider-knowledge.

I also would add that there is still some benefit to having lots of useless / unimportant questions, provided that many of them have short timelines - these questions are still great practice!
Metaculus scoring generally feels pretty fair. I strongly prefer it to the prediction market style system on Manifold. Metaculus rewards being right early more than being right late. If you change your prediction right at the last second, this has close to zero impact on your score. This reduces the need to constantly check for news related to every question - if breaking news means a question will resolve immediately, it’s too late for you to do anything about it. If it doesn’t cause an immediate resolution, you have a bit of time to adjust your forecast without a big penalty. On Manifold, you can be completely wrong for a year but check Twitter at the right time and end up wildly profitable on a question. Metaculus avoids this issue.

I also like that they weight the “middle probabilities” more than the extremes - guessing 40% when the community is at 50% earns more points than guessing 30% when the community is at 40%, which is better than 20% vs 30%, and so on. I think this correctly leads to higher weight being placed on hard-to-predict questions, and discourages people from “racing” to 0 or 99.99% to try and get an edge on very low or very high %-chance questions. A side effect is that many questions end up not mattering much for the tournament as they spend the whole time near 0 or 100%, where there aren’t many points to be gained, but that seems fine.

One quibble I have is that continuous questions seem to be rather arbitrarily weighted much more strongly than yes/no style questions. Maybe I just don’t understand how they’re being scored or something, but continuous questions were generally much more consequential in terms of tournament score than the Yes/No questions, and it’s not obvious to me why this is something we would want. It’s particularly annoying because Metaculus also has a very limited UI for inputting continuous distributions (guys, some things have probabilities that aren’t bell curves!). So continuous questions are both more important, but also impossible to input my “real” prediction on.
I’m pretty sure almost everyone is just following the community on almost every question. People seem really worried to deviate much from the community. With tournament scoring, I sort of understand this - you want to avoid a big negative score, so if unless you’re super confident that the community is wrong, it makes sense to defer to the crowd judgment. However, it’s important to realize that if lots of people do this, the community average gets worse, and loses the thing that made it good in the first place - a bunch of different perspectives considering different things which average out to a good guess.

There were a bunch of questions in the tournament where people were clearly very uncertain initially, but then once the community forecast was revealed, everyone moved towards the average. This creates an illusion where it looks like the community is highly confident in that number, even though the reality is that most of the predictors just have no idea.

Many of my best questions in the tournament were ones where I was pretty far off the community, thought about hedging towards the average, but ultimately decided that my reasoning was actually good and the community was probably just herding around a number without good reason. They key thing was being able to distinguish between “I’m very different from the community, and also pretty clueless on this question, so I should probably trust them more”, and “I’m very different from the community but I think I know what I’m doing”. Winning the tournament comes down to maximizing the latter without being wrong too many times.
They really should put a prize pool on these tournaments. These “breaking news” style tournaments which have lots of short-timeline questions covering a wide range of topics are very valuable to the forecasting community and should be prioritized more.

IMO one of, if not the single biggest problem in the forecasting community is that too many people seem to be spending too much time predicting on very long term questions (spanning many years to many decades), for which I’m not remotely convinced it’s even possible to make useful predictions on, and which it is certainly impossible to use to develop good forecasting calibration/skills. You can’t get calibrated if none of your predictions ever resolve! I’m very dubious of people who make these decade-scale predictions with no prior track record of making successful short term predictions (which describes about 95% of the AI doom / not doom blog posters).

Prize pools, even relatively small ones, will attract more predictors, improve the quality of the community prediction, and properly emphasize short term forecasting over forecasting on questions where forecasters may never know the resolution.

As an addendum to this, tournaments should include one-off prizes for particularly useful comments (or some other reward scheme for comments). Tournament prizes are good because you get more people engaged in the tournament which means more different perspectives on each question. On the other hand, prizes (and tournament scoring in general) also discourage useful commenting. Useful comments that meaningfully sway the forecast in what turns out to be the correct direction should be rewarded, perhaps comparably to winning the tournament. Ultimately the goal of the site is to get good predictions out to the world as far in advance as possible, and it’s hard to achieve this if the incentive is to hide the best information.

reply to this post

Predictions #7

September 24, 2023 at 2:14 PM

The bad news - I’ve fallen so far behind in writing up my tournament predictions that I am no longer willing to put in the effort it will take to get caught up. So the reasoning behind my last few weeks of predictions will be lost to history.

The good news - I’m somehow currently in 1st place in the tournament and actually have a good chance of winning! There’s only about a week left, and while I have 1 or 2 questions that I definitely will lose points on, there are a bunch more that I’ll probably gain points on. And win or lose, it’s pretty cool to be at the top this late in the game!

However things turn out, I have lots of thoughts and observations from 3 months of predicting that I’m excited to write about. Whether or not I end up joining the next Quarterly Cup (undecided on this, though if I end up winning it will be tough not to defend my crown), I’ll probably focus on writing these up, rather than detailing all my predictions.

Current Metaculus Stats
Total Points	1515 (+540)
Overall Rank	807 (-376)
QC Resolved Qs	30 (+11)
QC Open Qs	19 (+1)
QC Rank	1 (-1)
QC Take	10.4% (+3.8%)

reply to this post

ACE model update

September 6, 2023 at 12:15 AM

After explaining the methodology behind my cyclone energy forecast in the Metaculus comments, another user helpfully pointed out that the way I'm interpolating my uncertainty ranges from the full-season values is not correct. To be fair, I had assumed that my approach was probably not right, but now I know a better way to do it!

The problem is that I was scaling down the inter-quartile ranges for the full season linearly with respect to the percentage of the full-season ACE that should have accumulated by that day. For example, on the day when (looking at the historical average) we would have expected 30% of the full season's cyclone energy to have accumulated, I was scaling down the inter-quartile range to 70% of the initial value.

The problem is that inter-quartile ranges are proportional to standard deviation, and it's variance, not standard deviation which should scale linearly like this.

If you think of the ACE on each day as a random variable with some mean and variance, the full-season ACE is the sum of all those individual random variables. When adding two random variables, the variance of the sum is equal to the sum of the individual variances*. But the same is not true for standard deviation, which is the square root of the variance!

So on the day when 30% of the cyclone energy has accumulated historically, the variance should be 70% of what it was initially, while the inter-quartile range should be scaled by the the square root of 70%, which is 84%.

This correction makes the forecast less certain than it was previously, which means I've been predicting with too high of confidence for most of the question period. So if we end up with any kind of outlier result, I may get penalized pretty harshly.

*This assumes that the ACE for each day is independent of the other days, which definitely isn't true, but is an assumption I'm okay with making for the sake of creating a simple model

reply to this post

New analysis of driverless car safety

September 4, 2023 at 2:20 PM

Tim Lee has an excellent new post examining the track record of Waymo and Cruise, both of which were recently approved for fared driverless rides across San Francisco.

To sum up, Waymo’s driverless fleet has experienced:

17 low-speed collisions where another vehicle hit a stationary Waymo

9 collisions where another vehicle rear-ended a Waymo

2 collisions where a Waymo got sideswiped by another vehicle

2 collisions where a Waymo got cut off and wasn’t able to brake quickly enough

2 low-speed collisions with stationary vehicles

7 low-speed collisions with inanimate objects like shopping carts and potholes

There are two things to notice about this list. First, other vehicles ran into Waymos 28 times, compared to just four times a Waymo ran into another vehicle (and Waymo says its vehicle got cut off in two of these cases). Second, Waymo was only involved in three or four “serious” crashes, and none of them appear to have been Waymo’s fault.

So far, Waymo’s vehicles appear to be safer than Cruise’s (Waymo has been around a few years longer, so not necessarily surprising here), and Waymo so far seems to be safer than human drivers, while Cruise may be, but it’s less clear cut (at worst, they’re probably at least comparable). Recognizing and responding appropriately to unusual inanimate objects in the roadway seems to be the biggest challenge for both companies.

Also, the environment these vehicles have been driving in is above-average difficulty compared to the miles driven by most drivers:

Both Waymo and Cruise have their driverless cars avoid freeways, which tend to have fewer crashes per mile of driving. Both companies are active in San Francisco, which has more chaotic streets than most US cities.

The mainstream press coverage of the rollout for these companies (particularly Cruise) has been overwhelmingly negative, cherry-picking and misrepresenting the one or two worst incidents over millions of miles of driving, with no mention of how their records compare to the typical human driver.

I for one am looking forward to these coming to D.C. sometime soon!

reply to this post