Yali.ai

Role: Machine Learning Engineer

Period: June, 2022 → Present

Lead Developer and Maintainer, Proprietary Retrieval-Augmented Generative Engine:

Spearheaded the transition of the core system from an extractive BERT-style model to a finetuned LLaMa-based Large Language Model (LLM), enhancing the ability to generate abstractive, high-quality answers from client data.
Achieved a 40% improvement in the retrieval pipeline's performance, as measured by Mean Reciprocal Rank (MRR) and normalized Discounted Cumulative Gain (nDCG), by implementing custom-designed relevancy check systems for superior answer generation quality.
Engineered a super-low latency pipeline capable of delivering solutions in under 100 milliseconds, optimizing response times for client queries.
Drove continuous advancements in the system by integrating cutting-edge research findings, thereby consistently enhancing both retrieval efficiency and answer generation capabilities.

Lead Developer and Maintainer, Proprietary Large Language Model (LLM) API Deployment System:

Pioneered the development of an advanced LLM API deployment system, capable of launching any Large Language Model as an API in under 10 seconds. This system supports a range of optimisation engines including GPTQ, CTranslate2, vLLM, and AWQ.
Engineered a dynamic system optimization framework, enhancing memory usage, throughput, and latency. Implemented cutting-edge techniques such as Batched Processing and In-flight Batching, and developed intelligent routing algorithms to distribute requests between latency-optimized and throughput-optimized engines based on query type.
Authored custom code to leverage LLMs for Classification models with an integrated Confidence Scoring mechanism, achieving high-quality classification results at a speed 2x faster than standard generative response.

Lead Developer and Maintainer, Proprietary Speech Transcript Enrichment API:

Spearheaded the development of a cutting-edge speech transcript enrichment API, integrating advanced conversational summary models, sentiment analysis, and bespoke hate speech classification models, all engineered from scratch for optimal performance.
Designed and implemented highly efficient, distilled, low-latency models for conversational sentiment analysis, leveraging the power of 70B-parameter Large Language Models (LLMs). These models are optimized to provide high-accuracy sentiment assessments with ultra-fast response times (<50ms).
Employed innovative prompt-tuning techniques on proprietary LLMs to significantly enhance their capacity for generating high-quality, contextually accurate conversational summaries.
Integrated the API with real-time proprietary speech processing systems, ensuring seamless handling of live data streams and enhancing the system's applicability in dynamic environments.

In the dynamic startup environment, independently conceptualized and developed the APIs based on direct input from the CEO. This required a high degree of agility, self-direction, and the ability to rapidly iterate and adapt the solution to evolving business needs, demonstrating exceptional initiative and problem-solving skills in a fast-paced setting.