I've been putzing around with a self host. I have a, very basic, react.js frontend, minimal flask app and am using Ollama to serve llama3 8b. I'm running into a problem though, each query is handled one shot instead of as a chat
has anyone else messed around with this stuff?
Jump in the discussion.
No email address required.
Some nerd has it figured out, it's just got to do with the model loading afaik
Huggingface probably has something you can just download
Jump in the discussion.
No email address required.
Doing it mostly myself was kind of the point. just got stuck on this and it's hard to turn up searches that aren't irrelevant.
Jump in the discussion.
No email address required.
It's way harder than it seems. The way you have it running reloads the model with the prompt each time and ends the process after.
To have a consistent chat I think you would have to store your current chat in memory and have llama read it to "remember" otherwise there won't be continuity
Jump in the discussion.
No email address required.
Yeah, I have a flask server handling the actual calls and I set up sessions so it shouldn't be too much of a lift. Just feels dumb to resend the whole payload when it feels like I should be able to configure ollama to store sessions for a time period.
Jump in the discussion.
No email address required.
Also I have a mysql dB up so I can persist chats between sessions so I guess it should all work
Jump in the discussion.
No email address required.
More options
Context
More options
Context
More options
Context
More options
Context
More options
Context