> I think it's as close to the perfect benchmark as you can get.
Depends what you are benchmarking for... If you are benchmarking the ability of the solution to solve LEETCODE challenges, that is different to the ability of GPT4 to assist everyday programmers knock out business logic or diagnose bugs.
My experience of GPT4 is that it's significantly better at the latter than GPT3.5.
Additionally, the real test is for me is "Can an average programmer using GPT4 as a tool solve Advent of Code faster than an equally-skilled programmer without an LLM?".
I have done the same thing. “I am seeing the following behavior, but I expected <X>. What am I missing?” That line has often quickly solved some bugs that would have otherwise taken some time to debug. Very handy!
Can you share some examples of this? I haven’t had much luck with ChatGPT correctly identifying issues because (in my case, at least) they stem from other parts of a large codebase, and (last time I checked) I couldn’t paste more than a few kilobytes of code into ChatGPT.
One example are bugs caused by precondition violations, which ChatGPT can’t diagnose without also being given the code to all of the incoming call-sites, which means you end-up solving the problem yourself before you’ve even explained the issue to ChatGPT - so (to me, at least) my use of ChatGPT is more akin to rubber-duck-debugging[1] than anything else.
> Can an average programmer using GPT4 as a tool solve Advent of Code faster than an equally-skilled programmer without an LLM?
I mean, yes, obviously? Any tool, even if it had near-zero marginal utility must necessarily improve performance, as you can simply elect not to use it.
The disagreement is not in the direction, it's in the magnitude. Therefore the real test is: "Can an average programmer solve AoC faster with GPT4 and without syntax coloring or with syntax coloring and without GPT4?", or "GPT4 in C versus no-GPT4 in Python", or "GPT4 on a crappy laptop vs no-GPT4 on a high-end workstation" and so on.
> Any tool, even if it had near-zero marginal utility must necessarily improve performance, as you can simply elect not to use it.
That might be true for any tool with consistent, predictable output that you thoroughly understand. Someone could easily lose a few hours dickering around with chatgpt only to realize they’re better off just starting it from scratch.
> Any tool, even if it had near-zero marginal utility must necessarily improve performance, as you can simply elect not to use it.
No opinion on the specific tradeoff here, but in general it's not obvious to me that your statement is true. Using tools involves opportunity cost, so sometimes "not electing to us it" can on average be a net win, no? There is a cognitive and time load to using it at all, right?
A simpler test would be keeping the development environment the same and then adding GPT4 to see if there is a statistically significant and meaningful speed increase (of a decent magnitude).
I'm not looking for a 1% speed improvement, I'm looking for a >50% speed improvement. Maybe I should have stated 'significant' speed improvement in the initial post.
It's not meant to be a scientifically rigorous test, but the idea is this.
It seems like you're accepting that (at least in its current state, barring unknowable future discoveries) this is a technology that complements developers, making them more productive.
If this is the case, then the question is how relevant this improvement is in quantitative terms. The cumulative improvements in developer productivity since the days of punching cards have easily been several hundred percent.
A thought experiment to figure out how relatively impactful this is would be to compare it to other technologies and see which gives the greatest boost.
My prior belief: somewhere around Intellisense level of useful, but not significantly more than that.
Yes, I think the current state of technology (just starting) is a technology that complements developers.
Although my personal belief is this is more likely going to be an improvement similar to Assembly -> C (even if the LLM component doesn't improve, but assuming that the tooling does improve).
Personally I think there is going to be a new 'higher, higher level' of programming paradigms that are about to be invented that are supported by the ability of LLM's to write code - still augmenting humans, but making them many times more productive.
I think the opposite - LLMs will cause programming language development to become stagnant because the language that compliments an LLM the best is the language that has the most code out there in the wild.
So the "LLM language" already exists - its Javascript, Java, C#, etc.
This reminds me of a study where they compared the time taken to solve a problem comparable to a two-star AoC problem in difficulty across different dev environments, such as languages and IDEs.
There are a lot of anecdotes about how "it's the programmer, not the language" or "any language can be productive", etc...
The actual results were that many popular languages are more than an order of magnitude slower to achieve time-to-correct-solution than others. From memory, the fastest was F# with about 20 minutes, then Python and C# with about 40 minutes for both, and then C/C++/Java were hours to even days!
> ChatGPT never did this: its debugging skills are completely non-existent. If it encounters an error it will simply rewrite entire functions, or more often the entire program, from scratch.
Well, it's going to need to rewrite functions to add debug due to not having edit capabilities, but I tried this and it absolutely added debug info which it then used to debug issues:
I don't know quite what's happening but I feel like people constantly say it can't do something and the very first thing I try (just asking it to do the thing) usually works.
I gave it the hands in the problem statement and the expected result, and the explanation as to why (copy pasted). It ran the code, looked at the debug output, identified the problem and rewrote the function. I'm not saying it immediately solved the problem, but it easily added debug information, ran the code, looked at the output and interpreted it.
Same. I have pre-defined "comments", "error handling", "unit tests" and more, and I get it all. Plus if I target an area and ask for revision, i get just that... Tech perps both miss understand "AI" and think they have mastered it by default.
taking HN comments that claim ChatGPT can't do a thing, pasting them into ChatGPT, and then linking the reply where it does the thing, usually doesn't get me any points, in case someone wants to build a bot that does that for points.
Of the few I attempted this year I actually used ChatGPT to help me decipher just what the hell the waffling, rambling, windy, unclear, and ultimately superfluous challenge requirements were.
I didn’t have the solutions generated by ChatGPT to be clear, I used a prompt along the lines of “take this text and extract it’s requirements and generate bullet points, also make the example inputs and outputs clear”.
I did the same for the previous year too, when ChatGPT hadn’t been out for long.
I find that a lot more enjoyable and less tedious.
While not all people appear to appreciate your comment then I can relate to it. Not everyone enjoys this kind of story and world building and would like to get to the point already.
People (including me) can sometime solve the tasks quicker than others can read them. Just read the sentence above the example, then skim the explanation below and go for it. It's a gamble sometimes, but it's entirely feasible to skip all the story (it often do contain hints, though).
What strikes me about ChatGPT is the blatantly wrong answers it can give. I asked ChatGPT to solve a augmented matrix using gaussian elimination, and it failed in this straightforward task spectacularly.
Perfect example of "confident sounding hallucinations", I was just googling (a moment ago) why olive oil caused a burning sensation. It turns out there's a substance called oleocanthol[1] and there are receptors for that mainly in the throat. But while googling it, I see on Quora (which made it to the previews on Google) an "Assistant - Bot" response that is completely wrong: "Drinking olive oil can cause a burning sensation in the back of your throat due to its high fat content[...]"[2]
It wouldn't be notable if someone specifically asked ChatGPT, knowing its limitation, but using it to automatically populate Quora and Google with it is pretty bad. People are using LLM to fill the web with BS.
One is dumb as a brick, then other isn't. If you don't specify, then your comment should be dismissed out of hand.
Also, it's a well-known limitation of all current LLMs that they're terrible at basic algebra. Instead of trying to replace BLAS or Maple with it, ask it to write the Python code or produce the Mathematica expression.
This creator has some excellent videos of chat GPT attempting advent of code.
He uses it in a more generous format where he is often giving it multiple attempts and trying to coax the correct answer out of it. He definitely has more success with it than the article, but it is hard to tell how much of that success is due generous assistance and prompting, and so it is hard to know how much the model has actually improved year over year.
I won't spoil which one it eventually gets choked up on, but it does. But that's besides the point. If I had to sit down and do the Advent of Code challenges with and without ChatGPT's assistance, I know which one I'd choose. Programming's been forever changed, for better or worse, and pretending like it hasn't does no one any favors. I'm not doing LLM unassisted programming anymore, because it's a waste of my, and everyone else's time, and the industry needs to move accordingly.
I have an observation not related to chatgpt but to the debugging skills mentioned in the article: indeed I've always felt that most teaching is done on perfect working code. I've never seen an exercise for developing debugging skills. For example: "This Djikstra implementation is finding the wrong path." and work with students to pinpoint the off-by-one error. I think it would reveal so much more about the implementation details than just explaining how it works. It could be its own topic to explore race conditions, edge case studies and so on.
Debugging is by far the most difficult thing when working with AI algorithms because the output is rarely deterministic and that makes it hard to tell the difference between actual bugs and a bad selection of hyperparameters. You're also typically working huge arrays of numbers so unless you have some way to visualise what's happening it's also difficult to pin point where the algorithm is doing something it shouldn't be.
My main criticism of learning about algorithms using a framework is that they handhold so much with optimisation techniques and accessible APIs that you can know very little about what's actually happening and build something that works.
When writing AI algorithms just knowing how to implement in theory is rarely ever enough. The difficulty is always in the debugging and optimisation techniques.
Are these problems unique enough that they couldn't have circulated before the AoC where they appear now?
I notice that for "novel" code that CoPilot hasn't seen before, it's mostly useless, but when writing a hobby project which is a path tracer (of which there are probably 1000 implementations on GitHub) it's excellent. Which isn't surprising. It has seen the exact same function I'm writing, written 100 times in every language imaginable. There are books on the topic etc.
I’m curious…if you’re writing a path tracer for fun, does using Copilot not take the fun out of it? If it’s already been done a thousand times, it feels like the point has got to be to write your version, otherwise why not just copy-paste or pull in a library? If you use Copilot are you not giving up the joy of building it with your own hands? What’s left at that point?
Good question. There is quite a lot of boilerplate in graphics, which is basically copy paste/book textbook lookup and ChatGPT is fantastic for this sort of thing. I wouldn't be able to deduce this, it would be a copy paste from github/stackoverflow/physics book/whatever. ChatGPT though is like a magic copy paste that takes the formula from github or the pbrt book and pastes it with my variable names. So you don't do paste and then update the names.
// convert color from linear to sRGB
let srgb = [some formula]
// calculate wavelength dependent index of refraction
let nw = [something]
But yes, obviously somewhere between this and just saying "Create all the code I need" All the fun would disappear too. For the most part, that wouldn't work either. It would shit out 200 lines of code that doesn't work, and now it instead of taking the boring job of googling a formula from an optics reference replacing the variable names, it's now taking the role of shitty developer and made me the code reviewer. But see that's my day job and it's not what anyone wants.
I'm not sure a nail gun is the best analogy; it's a very simple tool that does one thing exactly as "instructed," and can't do any of the "thinking" for you.
Fair enough. Let's go with the analogy of a compiler, allowing you to write in python instead of assembly. Certainly this is doing a great deal of "thinking" for you. But hasn't it unlocked the more fun aspect of what you were doing? Not that this is true for everyone; some folks do still like to write assembly by hand, for fun or profit. But a compiler was still a remarkable advancement and worth celebrating.
Kind of. The precise, exact problem might be new, but it's similar enough to things that are already out there for many people to have helper functions at the ready and solve them with absurd speed.
Seeing the other comments, it still leaves me the question if this is a limit of scale or a limit of technology? How much more can LLMs scale up over the next decade?
This in itself proves nothing. Consider how much resources have been poured in making rockets better over the last few decades and yet we still can barely make it to Mars.
It is also quite possible that there are fundamental limits to scaling LLMs that we have not discovered yet. You won't find a way to let airplanes fly to the moon no matter how many resources you pour into it. To make it there, you need better methods (ie rockets in this case) that look nothing like the first few steps to get off the ground.
LLMs fairly quickly got to consuming a significant portion of all available data. I can't see them improving significantly just by throwing more data at the problem.
What would be really interesting would be to repeated attempt to get get ChatGPT to solve the problems.
By that I mean try to solve the Day 1 problem at the point of release, then try to solve it in a fresh ChatGPT session the next day, and then get it to solve it again the day after that, and so for the next couple of months.
Do that for each day that it runs for and then see what patterns emerge. Part of me expects it to get better at solving each problem as the rest of us write our solutions and then make them available on the web in some form for it harvest - but it would be interesting to see if that is the case.
No, it's practice. The model doesn't change day by day.
Kind of like saying "let's watch the same youtube video every day and see if it changes" - it won't, that's not how youtube works, there is no point trying.
Most people have that suspicion, myself included. But the API models
stay verifiably same, and also OpenAI would have little reason to lie about not updating ChatGPT.
Otoh, I can easily imagine helper/censorship models and chat’s system prompt being updated. System prompt doesn’t change capabilities much though, and the censor model behaviour doesn’t change the output, just cuts it off when it discovers copyright or other violations (the case of chat not being able to recite the Dune poem, DALLe not being able to produce certain images etc)
It would be interesting to compare the solutions from day 1 to the solutions a year from now (maybe less, but a year would give time for most people that cared to publish their solution and for them to be picked up in re-training).
Of course, an LLM will use the solutions that are in its training set. I asked an LLM to explain a pretty generic solution to one of last years AoC questions, and even with no (or very little) reference in the code to the fact it was an AoC solution the LLM used terminology from the question.
It wasn't ChatGPT, i think it was Claude.ai. ChatGPT gave an explanation you would expect from just seeing the code without the context it was related to AoC.
Seems pretty clear most of the fantastic results from before was overfitting. Like every other case. It's amazing to me how these models can be caught red handed being overfitted again and again and again, and people don't get the memo.
It’s hard to blame people when OpenAI’s whole shtick is that they’re mere years (at most) away from summoning a god machine that will usher in a techno-utopia, or destroy all of humanity. And to make matters worse, they have selected for this belief when hiring. So it’s pervasive throughout their work.
The data is out there to objectively assess these tools, but it’s not good for business.
What we need is a standards bodies creating tests in private and running distributed, blind tests using only publicly accessible APIs or open source models. There’s no reason the ACM or a similar body couldn’t do this for computer science evals, for example.
When that "AI-generated Angry Birds clone" was doing numbers, didn't it turn out that most of the code came straight from tutorials for writing Angry Birds clones?
Yep, because that's all AI is; a parrot. For well-defined problems it works great... BECAUSE THEIR WELL DEFINED.
Generative AI is only as good as the dataset that you give it, so for problems that exist in a heavily contrived and parameterized space (like leetcode-style problems) it works really well. But give it a novel problem with intertwining libraries and external dependencies along with custom type and structure definitions, it's going to fall flat.
Humans do something AI can't, and that's draw from experience to apply a solution on a novel problem. This is why I'm not terribly worried about AI coming for engineer jobs anywhere in the near future. I use ChatGPT all the time to write me small functions, generate regular expressions, etc. Basically all the drudgery.
I'd argue we've hit peak-AI at this point because from this point on, all datasets are going to be colored by AI-generated results. Generative-AI is now on a trajectory were it'll simply regress to the mean of knowledge.
How can you spell such a mistake in ALL CAPS and don't see it? ;)
I don't know about the rest of your comment, albeit I consider myself more in the non-Chomsky camp. This emergence thing seems a little more than an elaborate hoax to me.
Well, maybe they've just trained GPT-4 to wiggle my balls, when I ask it to analyse a poem that I wrote 15 years ago.
The local llm scene, regardless of this debate, is nevertheless the hottest topic IMHO atm.
Several finetunes have repeatedly blown my mind even in the past 2 weeks. On a Raspberry Pi.
AI can interface. Natural Language Processing. This is the first time in history humaity has such knowledge and technology.
You make it sound as if (software) engineer jobs are constantly doing cutting edge innovation, and perhaps as if all engineers are capable of it. That doesnt match my experience. Most of it is a more or less smart combination of standard concepts: Something GPTs are already pretty good at and i'm sure they will improve further.
But we are, at least I am, almost always integrating or interacting with new libraries, APIs, new features , new behaviours. I might not be personally writing that code but I definitely need to be up to date and adapting all the time.
I'm not a developer mind you but devops. And while I reuse code regularly as well, when new projects come up, I'll ramp up with said boilerplate. Though thats typically in only required if it's something new.
If you think we've hit peak you're grossly underestimating the sheer volume of copyrighted books, manuscripts, screenplays, podcasts, movies, documents, history, and research papers that ChatGPT hasn't been trained on. There's a LOT more juice to squeeze still.
This is actually incorrect, there's not that much data left to train on. I remember reading an article about it, might have been one of Gwern's or something about Chinchilla scaling, but to produce an order of magnitude increase we need an order of magnitude more data and there just isn't that amount available.
It might have, however a ton of gamedev related code (like giving adding physics to an object, and having it explode on contact with something else), is either formulaic, or can be glued together from pieces of other code, a feat ChatGPT is definitely capable of. It doesn't require it to develop a sophisticated model of the problem.
And the above applies to a lot of prod code as well.
When GPT4's training data was only up to Sep 2021, I fed it a Rust solution to one of the 2022 AoC problems I had lying around on my PC. I asked it to annotate it with with "formal doc comments".
Now keep in mind that it had never seen this problem, and it was written in a very terse "competition solving" style with zero comments, one-letter identifiers, and generic function names like "day1" and "parse".
It figured it all out. It worked out that it was for solving a maze -- even though the word "maze" was not used anywhere in the code!
It worked out that a constant "(-1,0)" in one isolated bit of code was "orienting the character west", even though to figure that out it had to trace the logic through 4 or 5 layers of indirection. It connected a single 'w' character in the parser to a vector somewhere else!
Etc...
PS: It wrote a more useful, more accurate, more coherent, and more grammatically correct comment than I had seen in any codebase I had worked on in something like two years.
> Problems start very easy on day 1 (sometimes as easy as just asking for a program that sums all numbers in the input) and progress towards more difficult ones, but they never get very hard: a CS graduate should be able to solve all problems, except maybe 1 or 2, in a couple of hours each.
I think this wildly overestimates the programming skills of the average CS graduate. My estimate of the fraction of CS graduates able to do that is closer to 1%.
Based on the leaderboard screenshot, only 7 people even solved up to day 12. The later days tend to be significantly harder than the early days. Based on top 100 leaderboard times, ~2x as hard for silver star.
Yeah this…the conclusion I would have drawn is not that the gold star took between 2-5 on average, but rather, 85% of people who self-select as enjoying a programming challenge where not able to finish more than half the tasks. Not that that's bad — seems slightly above par for the leaderboards I was involved in — just that the conclusion is really misleading.
Nothing will change the fact that the people participating in this challenge are not a representative sample of computer science graduates, hence an evaluation of their performance will not really tell much about what kind of performance one should expect from the average computer science graduate.
I will say that one of the guys who tied to win the competition is a 2021 Computer Science grad from Clemson. He's not a brand new grad, but that's the closest sample that I have.
The claim was that all computer science graduates should be able to solve those problems in a few hours each, the counter claim was that only one percent of them is capable to do so. The existence of one computer science graduates capable of winning the contest does not really tell us anything and I think nobody doubted that there are computer science graduates that are easily capable of solving such problems. But the question was or is, are they the exception or the norm?
The problems are essentially an elaborate IQ test (follow instructions, pay attention to the actual input, do some googling to get some polygon, graph algorithms/libraries). No deep knowledge is required.
You don't need a CS degree for that.
~5% of those who solved the first 2 problems solved all 49 of them. Given that it is a significant time commitment to spend several hours per day every day for 25 days straight, more people could have done it, given the time. These are not some elite self-selected geniuses. 1/4 of those who solved 1st problem, hasn't solved the 2nd one despite its simplicity
https://adventofcode.com/2023/stats
This would also have been my opinion, computer science graduates should definitely be able to solve those problems in a few hours. The longest time to first solution this is year was something like 15 minutes, an hour or two for 100 correct solutions, so they are certainly not really hard problems.
But personal opinion aside nobody provided any real evidence that this indeed the case and only that is what I wanted to point out in my comment.
There's still a huge selection effect. He could have been solving programming puzzles as a hobby for the past fifteen years for all we know, and his degree might have little to do with his AoC skills.
Everybody who participated seemed to enjoy solving the problems. Many of them have said they plan to solve the rest over the next couple of months, just not on the contest timeline.
Unlike Project Euler etc. where one really needs to be good at algorithms/math etc to make the most efficient code, else it will just run for days to brute force, most of advent of code can be solved with terrible algo as the input data is very small.
I think most CS grad students can solve Advent of Code. Some people, probably don't finish it not because it is hard, but probably because they lose interest.
In my experience solving several years of Advent of Code, this is only true for part 1 of most days. A lot of part 2 solutions rely on heuristics to be solved in reasonable amounts of time.
There are some from 2023 that aren't. Day 12 part 2 at least for me had one input that ended up being 95% or more of all permutations and would have taken at least weeks.
> I don't pay for ChatGPT Plus, I only have a paid API key so I used instead a command line client, chatgpt-cli and manually ran the output programs.
But this is much different, and likely much worse, than paid ChatGPT. Code Interpreter does its own debugging-and-revising loop.
It's bizarre to me that this author wouldn't pay $20 one time to evaluate the higher quality product, the one most people would use if they cared about code quality, am I missing something?
Then somebody could maybe just copy and paste two, three of the failed problems into ChatGPT Plus and ask for some code solving the problems? Should only take a hand full minutes.
I stopped my ChatGPT Plus subscription and replaced it with pure GPT-4 API calls because the "product" built around the API just got dumber and dumber over time.
- ChatGPT pros:
Has some bells and whistles like code interpreter (which I can easily get via Open Interpreter).
Has plugins (although I found web browsing to be inferior compared to Poe/Perplexity).
- Pure GPT-4 API pros:
Is not dumbed down or forced to "forget" things or be lazy in coding.
I use the API either programmatically or through Poe.
Thank you. I can't find any clear comparison between ChatGPT Plus and gpt-4 (through the API) on that article though, it seems to be focused on how to improve results with gpt-3.5? I was expecting some benchmarks
It's the same line of thinking that results in 90% of stuff on indiehackers and by solo devs is for themselves and other solo devs, tech tools that are extremely niche, for a market with people not willing to pay for them.
An extreme blindspot.
Every introverted dev should go outside and meet other people, check out the irl marketplaces with 1 coffee for 10$, thats the best investment they can make.
You can’t sign up for chatgpt plus anymore. they aren’t accepting new subscriptions. You just get added to a waitlist. At least that’s what happened a few months ago when I tried and I haven’t heard anything from them yet.
That might be wrong, i just did a paid subscription for my dad at xmas time, pretty shure you can still get paid subscriptions for ChatGPT+. We used his Googleaccount for this.
AoC questions this year were _deliberately written_ to be confusing to LLMs, it's not failing because it's worse it's failing because the questions were written to make it hader for models :]
Edit: apparently not, the author is just really good at coming up with ai adverse puzzles. When testing ChatGPT did much better on last year’s puzzles.
> I don't have a ChatGPT or Bard or whatever account, and I've never even used an LLM to write code or solve a puzzle, so I'm not sure what kinds of puzzles would be good or bad if that were my goal. Fortunately, it's not my goal - my goal is to help people become better programmers, not to create some kind of wacky LLM obstacle course. I'd rather have puzzles that are good for humans than puzzles that are both bad for humans and also somehow make the speed contest LLM-resistant.
> I did the same thing this year that I do every year: I picked 25 puzzle ideas that sounded interesting to me, wrote them up, and then calibrated them based on betatester feedback. If you found a given puzzle easier or harder than you expected, please remember that difficulty is subjective and writing puzzles is tricky.
He may not have consciously written them to confuse LLMs but even without using any he probably knows they get more confused on reasoning tasks stated in not very clear text. At least for a few problem descriptions I couldn't help feeling that the statement was a lot more complex than it could have been and not only because of the story woven around it. Of course it could have been there to confuse the humans :)
Some people have even speculated that the problems this year were deliberately formulated to foil ChatGPT, but Eric actually denied that this is the case.
Citing the author of AoC:
Here are things LLMs didn't influence: The story. The puzzles. The inputs.
I did the same thing this year that I do every year: I picked 25 puzzle ideas that sounded interesting to me, wrote them up, and then calibrated them based on betatester feedback.
I think you should maybe be less confident in your statements if you actually don't really know. Highlighting _deliberately written_ when it was anything but.. well, that's confident bullshitting, LLM style, ironically.
I thought they were as well, because there were nuances which made them harder for LLMs. When I tried on 1 dec, gpt4 couldn't solve day 1 part 2. And I tripped over exactly the same thing ; when I parsed it correctly and then explained to the llm, it did solve it. But no where near as fast as me with my bag of horrible hacking aoc lib...
It's interesting it turns out this year was not written with gpt in mind.
The article actually links to a Reddit post[1] where the creator of Advent of Code says that 2023's puzzles weren't engineered to be more difficult for LLMs.
It will get incrementally better at being a word calculator and looking up text, but LLMs aren't going to magically gain critical thinking skills or impactful intelligence capabilities.
Then again, I think the opposite is (at this moment) equally unfalsifiable. Humans have on average 86 billion neurons, sperm whales have bigger brains and their average neuron count is 200 billion. African elephants have 257 billion neurons. There are some reasons to believe we are kind of special.
While I also don't think autoregressive LLMs will ever gain sentience, I also don't subscribe to the other side of the argument that it;s all a dead end and true(tm) intelligence can never be achieved. Fact of the matter is, we don't know. We don't know why we are sentient, so how can we argue what may or may not lead to sentience when we have a lacking understanding of the inner working of current gen LLMs? A little more humility on what we don't know is in order, from all sides.
My guess is we won't know when we've developed a sentient AI until we're well past the point that it's undeniable, and then in hindsight we might be able to identify where it began to transition from non-sentient to sentient.
Sentience isn't even well-defined and I'm not sure we can even point to any unique quality of any individual human as indicative of sentience. We certainly can't agree on what point an individual human becomes sentient, as evidenced by the abortion debate. At best we seem to have some statistical evidence based on what we've achieved as a species being superior to what oak trees have achieved, but at an individual level, I don't know how to prove to anyone that I'm not just a very advanced LLM.
There is another sense where sentience is just being used to mean "that which is human", and by definition, nothing aside from humans will ever qualify as that.
Sentience is a property of animal life. Electrons, software, rocks, etc don't posses sentience, so the notion that ai which is software may be sentient is nonsense.
This is just extending the "sentience is that which is human" definition to animals.
What about animals makes them sentient? Until we can answer that, people are just going to be talking past each other. Even if you forget about AI, whether animals are sentient, and, if so, which animals are sentient, is a big argument in biology/ethics/law that's been going on for at least half a century. The UK recently passed a law that declares animals to be sentient, but not all invertebrates are considered sentient in that law. Their justification is that they don't contain a central nervous system, but what special property about a CNS confers sentience?
The "LLMs will attain intelligence/consciousness if we just throw enough neurons at the problem" crowd really seems to be engaging in cargo cult thinking to me. Not that I think it would be impossible to make a conscious or intelligent mind using simulated neural networks — I don't think that there's some kind of "special mystical property" about humans like a soul or something — it's just that the structure and training of LLMs is fundamentally wrong for producing intelligence, that's all. It's an easy, simple route that gives you startling results for the first 30% but it's a dead end in the long run for achieving what they want, yet they're insisting that if they just keep pantomiming hard enough real intelligence will come.
Humans weren't trained to imitate a pre-existing body of text. We were trained to survive & reproduce in a competitive, limited-resource environment. This pressured us to develop critical thinking. I see no reason why we should expect the same results from a radically different process.
Correct - there are no examples of evolution in non living things. There are however evolving ways of using things, but things don't evolve, don't think, and don't learn. Uneducated people do assign such attributes to things - when people first saw tractors, planes or robots they thought they were the devil, coming to grab them. Much like ai doomers or those that support the notion that they are somehow intelligent do. People used to think computer viruses are alive and have a mind on their own. Again similar to people thinking tractors are the devil. Same old lack of understanding of technology.
There have been rumours that the current ChatGPT loses money even at $20/month, and that to economise on running costs they've changed to a less capable model.
And if the modern LLMs have already mined everything there is to get out of pirated ebooks and the common crawl dataset - who knows how long it'll take for them to make the next big step forward?
Oh, I absolutely agree that the field isn't stationary. It's moving at a breakneck pace, in fact!
But synthetic training data has its problems. Oh, it's great if you want a limitless supply of templated high school math problems. In other fields, though? If GPT-whatever is a bit unclear on whether Magnus Hirschfeld was a Nazi or a victim of the Nazis, and you use it to generate synthetic training data - you can't expect the student model to know better than the teacher model does.
> you can't expect the student model to know better than the teacher model does.
Why not? People outgrow their teachers all the time, even in fields where there is no new external data, like maths. Often improved understanding comes just from reflecting on an issue from new perspectives, which synthetic data can provide. Perhaps a teacher/student model helps AI develop those new insights, just as it does for us.
If (when taken in aggregate) the training data says Gerald Ford was the 38th President of the United States, and the model disagrees, is that not a deficiency of the model?
Probably, and sometimes teachers are wrong. Also, I don't know who Gerald Ford is but I'll take your word that he's the 38th president of the United States as nobody is here to tell me otherwise ;). It's likely though that some otherwise intelligent and sentient people may disagree on who is the current president of the United States!
I don't believe anybody is suggesting that we exclusively use synthetic data, but rather that synthetic data can augment other types of training. The other thing to consider is that less sophisticated models can be prone to hallucinating nonsense, but the hallucinations are usually inconsistent, whereas truthful responses tend towards consistency in various directions: between each response, internal consistency, and consistency with reality.
It's conceivable that a more sophisticated model would be able to learn a sense of certainty based on the consistency of its training data, much as we do. If you consider your education, I'm guessing you probably had lots of people tell you incompatible nonsense over the years. In my case, I've had teachers give me a ton of inconsistent explanations about how electrons "know" which path to take in a circuit - probably one of my biggest questions since a child. Only one explanation turned out to be internally consistent and demonstrable with experiments I've seen on YouTube. The result is that I now have, I think, a pretty decent understanding of how electric current forms a path within a circuit, or at least one that can be used to make valid predictions, despite being told a vast amount of inconsistent and wrong information over my life.
There's no way to know if they switched models unless they publicly admit it, because the nature of the tech makes it extremely hard to know why it does any specific thing at all, let alone to debug it to find out why it didn't do the thing one wanted.
ChatGPT discussion here has been totally dominated by this problem from the beginning, but usually it's people raising the bar on what you have to do to get good results from it.
It shows that people fundamentally do not understand the tools they are using.
Given a book of numbers, here are two tasks:
1) copy out the entire book, but replace every prime number with 7.
2) write down the list of prime numbers in the book.
Which one is easier?
LLMs have to generate tokens one at a time, and it’s very very difficult to perfectly generate a set of input tokens except for some tokens.
Since you are almost certainly randomising the probabilities to some degree (that’s what temperature does), you’re also asking for both deterministic and random outputs.
TLDR: ask LLMs what is wrong with the code.
Ask for a diff.
Don’t ask for an LLM to refactor, bug fix or annotate code…
That’s extremely naive usage.
Back to my stupid analogy: “please copy out this book, but fix the numbers which are ‘wrong’”
I can hardly complain when I get terrible results can I?
If you don't understand that an LLM generates output token-by-token, and that as a result of randomizing the token output probabilities that you cannot generate an error free copy of the input you've invested so little time in understanding as to be farcical.
There's 'wow, these are complicated and I don't fully understand them'
...and there's, 'What this. It shiny. Not worky. Make some random change to prompt and pray to LLM gods'
I think you're being unfairly downvoted. People generally are terrible at using tools, especially tools that they perceive as "black boxes".
I've seen an IT professional type this into Google: "Why did my PC crash?"
I couldn't believe that after two decades of using web search technology, he still hadn't figured out how to extract value from a text index using specific and relevant key words.
Similarly, very few developers know what a database index does or whether they need one or not. My pet theory is that NoSQL databases become popular because many of them automatically index every column, making it feel like a magic black box instead of an evil black box.
LLMs are not only black boxes, but they're soooo fundamentally different and new that a lot of people are really struggling to wrap their brains around it.
Just in this thread, today, there are people that are complaining about the direct equivalent of "my poorly thought out Google search didn't work, so Google is bad."
Here's a re-wording of our conversation without that; you decide how you want to take it.
me: I am sad because people are clearly using these tools without understanding them; here is a specific example and reason of why what they're trying to do doesn't work.
you: these tools cannot be understood.
me: not only is that is literally false, it's obviously and self evidently false, and I just gave you a specific example of how; I can't take anything you say as being in good faith when you believe that anything to do with AI is literally unfathomable, and you can't even be bothered responding to what I actually wrote.
Maybe, looking into it, you would find that it's not nearly as complicated or difficult as you imagine.
My impression of the 2023 AoC was that Wastl has spent considerable effort on making it less LLM-friendly this year. Some days, this seems to have been done by adding extra conditions and complications which make the task more difficult for LLMs to parse. Other tasks required studying the input data which is difficult to achieve with an unsupervised LLM. Finally, the first couple of days seemed a lot more difficult this year than previous years, possibly to deter chatgpt users from filling up the leaderboard right away (though this year December started with a weekend which could also be a contributing factor).
I also felt there were more problems than usual this year that could not easily be solved without looking at the input for special cases not alluded to in the problem descriptions. (As someone who has solved all 25 for the past 3 years).
An extreme example was this year's day 20 circuit-simulating problem, which was made far easier by having the given circuit split up into a few independent chunks that are only connected at the start + end. (I suspect it might be NP-complete without this feature)
It's a slightly different kind of problem solving to think "what makes this particular input easier than the general version of this problem", and one that I'd naively assume LLMs are less skilled at.
There have been quite a few of these in the past. Some of the 2018 problems (e.g. day 21 [1]) required quite a lot of reverse engineering of programs in a custom instruction set.
> considerable effort on making it less LLM-friendly this year
Wastl himself denied that this is the case[1]. This is a lie.
> Other tasks required studying the input data
Always been the case for AoC, how is that different from other years?
We have a clear example of a set of tasks that state-of-the-art LLMs cannot perform. We are doing science, for once. Why do we need to get into full conspiracy mode?
I apologize. It was simply my subjective impression of the AoC this year, not a statement of objective fact. I didn't intend to insinuate that Wastl is a liar.
Depends what you are benchmarking for... If you are benchmarking the ability of the solution to solve LEETCODE challenges, that is different to the ability of GPT4 to assist everyday programmers knock out business logic or diagnose bugs.
My experience of GPT4 is that it's significantly better at the latter than GPT3.5.
Additionally, the real test is for me is "Can an average programmer using GPT4 as a tool solve Advent of Code faster than an equally-skilled programmer without an LLM?".