4 4 2

Benjamin Merkel

BM-TNG

AI & ML interests

None yet

Recent Activity

updated a model about 1 month ago

tngtech/DeepSeek-TNG-R1T2-Chimera

updated a model about 1 month ago

tngtech/DeepSeek-R1T-Chimera

commented on an article about 2 months ago

Efficient Request Queueing – Optimizing LLM Performance

View all activity

Organizations

updated 2 models about 1 month ago

tngtech/DeepSeek-TNG-R1T2-Chimera

Text Generation • 685B • Updated Nov 4 • 1.17k • 259

tngtech/DeepSeek-R1T-Chimera

Text Generation • 685B • Updated Nov 4 • 648 • 265

commented on Efficient Request Queueing – Optimizing LLM Performance about 2 months ago

yes, disaggregated prefill has improved by now, although it still takes some custom work to set it up correctly.
The original challenge of multi-user serving with heterogeneous load patterns and priorities remains, though.

commented on Prefill and Decode for Concurrent Requests - Optimizing LLM Performance about 2 months ago

prefill and decode can be processed in the same batch because the operations on token level are identical. During prefill, we simply throw away the logits from all but the very last prompt token.
There is nothing special in chunked prefill in how prefill and decode can be combined in a single batch. It's just that it makes more sense to do it with chunked prefill because otherwise the decode will take ages if it was scheduled in the same batch as a very long prefill.

New activity in tngtech/DeepSeek-TNG-R1T2-Chimera 5 months ago