I also have a a scratch-my-own-itch project[1] that leverages an LLM as a core p...

I also have a a scratch-my-own-itch project[1] that leverages an LLM as a core part of its workload. But it's so niche I could never justify opening it up to general use. (I haven't even deployed it to the web because it's easier to just run it locally since I'm the only user.)

But it got me interested in a topic I have been calling "token economization." I'm sure there's a more common term from it but I'm a newb to this tech. Basically, how to optimize the "run rate" for token utilization per request down.

Have you taken a stab at anything along this vein? Like prompt optimization, and so on? Or are you just letting 'er rip and managing costs by reducing request volume? (Now that I've typed this comment out I realize there is so much I don't know about basic stuff with commercial LLM billing and so on.)

[1] https://github.com/mattdeboard/itzuli-stanza-mcp

edit:

I asked Claude to educate me about the concepts I'm nibbling at in this comment. After some back-and-forth about how to fetch this link (??), it spit out a useful answer https://claude.ai/share/0359f6a1-1e4f-4ff9-968a-6677ed3e4d14