I've been putzing around with a self host. I have a, very basic, react.js frontend, minimal flask app and am using Ollama to serve llama3 8b. I'm running into a problem though, each query is handled one shot instead of as a chat
has anyone else messed around with this stuff?
Jump in the discussion.
No email address required.
I haven't used these since there were suddenly jobs in it, but I think you have to pass all the context you want the model to consider in almost all cases (fine tuning and the openai assistant API are exceptions), so keep a
messages
list with all prior messages and feed the whole convo in each time. The demos they have seem to bear that out, doing amessages.append(message)
each and every time.Jump in the discussion.
No email address required.
Huh, seems like a lot of chatter but I guess bandwidth is cheap
Jump in the discussion.
No email address required.
Yeah cramming everything into the one call (context window is term of art) is one of the big downfalls of the current stuff but I don't see how it's fixed
Jump in the discussion.
No email address required.
More options
Context
More options
Context
More options
Context