📫 If you wish to contact me, you can send an email to [email protected], or add my WeChat echo -n 'eGlleWRkX2hhaGEK' | base64 -d
.
💻
As of 2024, I have over 6 years of experience in AI Infrastructure:
2018-2021.2 (including internship) Unisound
- At the AI algorithm company Unisound, I was responsible for the development and operation of the Atlas supercomputing platform, supporting NLP and CV model training. Key responsibilities included:
- Developing a large-scale intelligent scheduling system to optimize multi-tenant resource allocation
- Enhancing the performance of the high-performance distributed file system Lustre
- Building a multi-layer cache cloud-native architecture to accelerate AI model training
- Worked on 8 Bit training and inference optimization at Unisound, optimizing models for NPU and NVIDIA Edge Devices.
2021.2-2023.5 Tencent Cloud
- Developed a large-scale AI platform for public cloud:
- Built a high-performance, scalable elastic offline training platform using EKS (Elastic Kubernetes Service).
- Integrated public cloud object storage and the GooseFS accelerator to create a high-performance cache scheduling system on the cloud
- Established FinOps infrastructure to help public cloud customers manage and optimize cloud costs more effectively, enhancing cloud resource utilization:
- Optimized scheduling and rescheduling, identified high and low priority tasks, and implemented intelligent elastic scaling.
- Combined Tencent's Ruyi kernel scheduler optimization and observability to optimize costs while maintaining service quality
- Launched a large-scale cost reduction initiative in the internal cloud, improving resource utilization through efficient resource allocation
2023.5-present Tensorchord
- Leading the development of the Serverless Inference platform ModelZ on GCP, providing optimized cold start model service inference:
- Reduced model service cold start time through cache model services and image preheating
- Implemented JuiceFS to build a high-performance cache scheduling system, enhancing model service performance
- Leading the Cloud Team, developing the vector database VectorChord's cloud service and customer support VectorChord Cloud:
- Built a vector database based on Postgres on AWS, achieving control and data plane separation, BYOC (Bring Your Own Cloud), BYOD (Bring Your Own Data) capabilities
- Implemented cloud-native architecture to achieve Postgres storage and compute separation, high availability, Backup, PITR (Point-In-Time Recovery), In-Place Upgrade features
Skill set: Kubernetes, GCP, AWS, Kubeflow, FinOps, RAG, Vector Database, Storage Acceleration, Tensorflow, Pytorch, Cloud Native, MLOps, AI Infrastructure, etc.
Open Source Projects
🌱 Currently focusing on MLOps and FinOps, contributing to several open source projects:
- fluid Fluid, elastic data abstraction and acceleration for BigData/AI applications in the cloud. (Project under CNCF)
- crane Crane is a FinOps Platform for Cloud Resource Analytics and Economics in Kubernetes clusters. The goal is to help users manage cloud costs more easily while ensuring application quality.
- crane-scheduler Crane scheduler is a Kubernetes scheduler that can schedule pods based on actual node load.
- creator Creator is the brain of the crane project, containing the core algorithm module and evaluation module.
- openmodelz One-click machine learning deployment (LLM, text-to-image, etc.) at scale on any cluster (GCP, AWS, Lambda labs, your home lab, or even a single machine).
- clusternet [CNCF Sandbox Project] Managing your Kubernetes clusters (including public, private, edge, etc.) as easily as browsing the Internet
- vectorchord Scalable, fast, and disk-friendly vector search in Postgres, the successor of pgvecto.rs.
Recommended Blogs
Type | Author/Company | Blog URL |
---|---|---|
Infra | Chris Riccomini | https://materializedview.io/ |
Infra | Jack Vanlightly | https://jack-vanlightly.com/ |
Math and Science | 苏剑林 | https://kexue.fm/ |
AI Infra | Colfax | https://research.colfax-intl.com/blog/ |
Postgres | Gabriele Bartolini | https://www.gabrielebartolini.it/articles/ |
AI | Sebastian Raschka | https://magazine.sebastianraschka.com/archive?sort=new |
AI Algorithm | Tom Yeh | https://www.byhand.ai/ |
AI Infra | Chip Huyen | https://huyenchip.com/blog/ |