The first step is the hardest: pitfalls of representing and tokenizing temporal data for large language models.
Large language models (LLMs) have demonstrated remarkable generalization and across diverse tasks, leading individuals to increasingly use them as personal assistants due to their emerging reasoning capabilities. Nevertheless, a notable obstacle emerges when including numerical/temporal data into these prompts, such as data sourced from wearables or electronic health records. LLMs employ tokenizers in their input that break down text into smaller units. However, tokenizers are not designed to represent numerical [...]
Author(s): Spathis, Dimitris, Kawsar, Fahim
DOI: 10.1093/jamia/ocae090