Uni Internship May to Dec 2026 - Intelligent ETL Using Pre-trained Models/LLMs
Date: 2 Jan 2026
Location: SG
Company: Synapxe
Synapxe is the national HealthTech agency inspiring tomorrow’s health. The nexus of HealthTech, we connect people and systems to power a healthier Singapore. Together with partners, we create intelligent technological solutions to improve the health of millions of people every day, everywhere.
Are you someone who enjoys problem solving, has a creative and curious mind, and strives to create a better and healthier tomorrow? If you say yes to all, do check out our website and find out more about Internship@Synapxe.
Join Synapxe as an intern and see how you can contribute in powering a healthier Singapore. We aim to deliver the best experience for all interns, to create exponential growth and paving your future in the tech industry.
Project: Intelligent ETL for Annotating Healthcare Data Using Pre-trained Models/LLMs
MOH and other public healthcare institutions generate vast amounts of unstructured data daily, including clinical notes, diagnostic reports, patient records, and medical imaging metadata. This wealth of information holds tremendous potential for improving patient outcomes, advancing medical research, and optimising healthcare delivery. However, extracting meaningful insights from this data requires sophisticated annotation and processing capabilities. Traditional manual annotation of healthcare data is time-intensive, costly, and prone to inconsistencies. The emergence of pre-trained language models and large language models (LLMs) presents an opportunity to automate and enhance the annotation process whilst maintaining high accuracy standards.
By leveraging these advanced models within a robust data processing framework, healthcare organisations can unlock the value of their data assets more efficiently. This project focuses on developing an automated healthcare data annotation system using state-of-the-art pre-trained models and LLMs, implemented through Databricks' unified analytics platform to process various types of healthcare data, extract relevant clinical entities, classify medical conditions, and generate structured annotations that support downstream analytics and machine learning applications.
The selected intern will participate in, but is not limited to, the following:
- Data Analysis: Examine healthcare datasets to understand characteristics, quality issues, and annotation requirements across various formats (clinical notes, lab reports, radiology findings)
- Model Research: Evaluate pre-trained models (BioBERT, ClinicalBERT) and LLMs for healthcare annotation tasks based on accuracy, speed, and compatibility
- Pipeline Development: Implement ETL pipeline using Databricks platform integrating models for named entity recognition, medical coding, sentiment analysis, and outcome prediction
- Performance Optimisation: Design scalable workflows with distributed computing, caching strategies, and parallel processing for large-volume data handling
- Quality Assurance: Develop evaluation frameworks comparing automated results against gold standards and implement continuous monitoring systems
- Documentation: Create comprehensive documentation and knowledge transfer materials for system maintenance and extension
About you:
- Be pursuing a Bachelor Degree in Business Analytics, Computer Engineering, Computer Science, Data Science or related discipline
- Graduating in Dec 2026 or May 2027
- Programming Skills: Proficient Python, experience with data science libraries (Pandas, NumPy, Matplotlib), familiarity with ML frameworks
- SQL proficiency, understanding of relational and NoSQL systems
- Cloud computing platform experience (preferred), reliable computing resources and internet connectivity
- Commitment to confidentiality and data protection protocols, completion of required compliance training
- Good team player with strong analytical and communication skills
- Ability to multitask and work effectively as part of a multidisciplinary team
- Passionate and keen to make a difference to re-imagine the future of HealthTech