Thread theme
I don't like how OpenAI continues to close off more and more of what they are doing behind the scenes but frick me I guess. 4o was less of a step forwards but more of a step sideways, and this O1 bullshit is in the same vein.
They're supposed to be running crazy transformers with apps (consumers) using it for CoT but who fricking cares about that shit anyways number goes up slop about phd and bro it scored like 85 on this one test bro just one more prompt bro just 900 more crosschecks bro.
If I read one more fricking r-slur post about strawberries i will unironically forcefeed them 3000 strawberries
Jump in the discussion.
No email address required.
!codecels have any of you obtained access or tested it in agentic i.e. langchain programs
Jump in the discussion.
No email address required.
Don't you have to work for them to have access to it? They keep it locked up very tightly.
Jump in the discussion.
No email address required.
I was referring to API access for testing with 0 temperature. i'm not referring to trying to guess the internal prompting that OpenAI utilizes.
Jump in the discussion.
No email address required.
More options
Context
More options
Context
I have access in normal chatgpt, unless this is different:
Jump in the discussion.
No email address required.
its o1-preview, 4o is the her. impression. Have you tested it against benchmarks yet?
Jump in the discussion.
No email address required.
nah I just wanted to brag I had it
Jump in the discussion.
No email address required.
my corp has access too, but we haven't been able to do benchmark testing yet. We still use 4turbo over 4o + mix of other models
Jump in the discussion.
No email address required.
More options
Context
More options
Context
There are LLM benchmarks? Wtf can that even mean? Rs in strawberries?
Jump in the discussion.
No email address required.
...yes? Are you r-slurred? You can also create custom benchmarks for your own usage to keep track of the random tweaks they (llm providers) do.
As to the latter, many brainlets, mbneurodivergents, and turbospergs online complain that LLMs can't count the number of Rs in the word "strawberry".
Jump in the discussion.
No email address required.
... umm, of course I'm r-slurred?
Explain all your egghead shit you're saying in a way that doesn't piss me off.
Jump in the discussion.
No email address required.
benchmark is like figuring out llm's 0-60, quarter mile, RPM, and more. Simple people care about first two, true chads want to know how close they can get to overheating the engine. Stock cars give stock outputs but you can hook up your own shit to really figure out the specs.
As to R's in strawberries, LLMs are a very strong parrot. Imagine if you asked a parrot how many Rs are in the word strawberry? It knows you've said numbers such as 2, 3 and 4 near that phrase in the past, so it picks one of the numbers because you wanted a number, but said number may not be correct. For example, if you always said "I have five fingers and I like to count the letter r in strawberry", multiple times to a parrot, the parrot will probably say there are "5" rs in strawberry.
Jump in the discussion.
No email address required.
Now tell me what temperature is. That like the cfg on SD? How do I set up llama such that I ask it how many Rs are in strawberry, and it starts talking to me like it has a fever?
Jump in the discussion.
No email address required.
Do you know cars? Temp is like asking what happens if you frick with a setting in the ECU. It's related to probability curves and is complicated.
Tl;dr If you've given lots of numbers and have temp set to 0.7 it technically can choose from many different options and even more. Setting it to 0 kills the curve and forces it to use a smaller list.
https://medium.com/@albert_88839/large-language-model-settings-temperature-top-p-and-max-tokens-1a0b54dcb25e
No clue what you mean by "talking like it has a fever".
Going back to car analogy, you set it to 0 so you know its the llm provider fricking around. Your car will autoshift for a number of reasons, but setting temp to 0 is like hardcoding at X rpms go up one gear. You benchmark it to make sure that happens each time. If all of a sudden gear isnt going up at X rpms anymore, you know the llm provider fricked with the model again.
Jump in the discussion.
No email address required.
More options
Context
More options
Context
More options
Context
More options
Context
More options
Context
More options
Context
More options
Context
More options
Context
More options
Context