About xieydd

📫 If you wish to contact me, you can send an email to [email protected], or add my WeChat echo -n 'eGlleWRkX2hhaGEK' | base64 -d.

💻

As of 2024, I have over 6 years of experience in AI Infrastructure:

2018-2021.2 (including internship) Unisound

  1. At the AI algorithm company Unisound, I was responsible for the development and operation of the Atlas supercomputing platform, supporting NLP and CV model training. Key responsibilities included:
  • Developing a large-scale intelligent scheduling system to optimize multi-tenant resource allocation
  • Enhancing the performance of the high-performance distributed file system Lustre
  • Building a multi-layer cache cloud-native architecture to accelerate AI model training
  1. Worked on 8 Bit training and inference optimization at Unisound, optimizing models for NPU and NVIDIA Edge Devices.

2021.2-2023.5 Tencent Cloud

  1. Developed a large-scale AI platform for public cloud:
  • Built a high-performance, scalable elastic offline training platform using EKS (Elastic Kubernetes Service).
  • Integrated public cloud object storage and the GooseFS accelerator to create a high-performance cache scheduling system on the cloud
  1. Established FinOps infrastructure to help public cloud customers manage and optimize cloud costs more effectively, enhancing cloud resource utilization:
  • Optimized scheduling and rescheduling, identified high and low priority tasks, and implemented intelligent elastic scaling.
  • Combined Tencent's Ruyi kernel scheduler optimization and observability to optimize costs while maintaining service quality
  • Launched a large-scale cost reduction initiative in the internal cloud, improving resource utilization through efficient resource allocation

2023.5-present Tensorchord

  1. Leading the development of the Serverless Inference platform ModelZ on GCP, providing optimized cold start model service inference:
  • Reduced model service cold start time through cache model services and image preheating
  • Implemented JuiceFS to build a high-performance cache scheduling system, enhancing model service performance
  1. Leading the Cloud Team, developing the vector database VectorChord's cloud service and customer support VectorChord Cloud:
  • Built a vector database based on Postgres on AWS, achieving control and data plane separation, BYOC (Bring Your Own Cloud), BYOD (Bring Your Own Data) capabilities
  • Implemented cloud-native architecture to achieve Postgres storage and compute separation, high availability, Backup, PITR (Point-In-Time Recovery), In-Place Upgrade features

Skill set: Kubernetes, GCP, AWS, Kubeflow, FinOps, RAG, Vector Database, Storage Acceleration, Tensorflow, Pytorch, Cloud Native, MLOps, AI Infrastructure, etc.

Open Source Projects

🌱 Currently focusing on MLOps and FinOps, contributing to several open source projects:

  1. fluid Fluid, elastic data abstraction and acceleration for BigData/AI applications in the cloud. (Project under CNCF)
  2. crane Crane is a FinOps Platform for Cloud Resource Analytics and Economics in Kubernetes clusters. The goal is to help users manage cloud costs more easily while ensuring application quality.
  3. crane-scheduler Crane scheduler is a Kubernetes scheduler that can schedule pods based on actual node load.
  4. creator Creator is the brain of the crane project, containing the core algorithm module and evaluation module.
  5. openmodelz One-click machine learning deployment (LLM, text-to-image, etc.) at scale on any cluster (GCP, AWS, Lambda labs, your home lab, or even a single machine).
  6. clusternet [CNCF Sandbox Project] Managing your Kubernetes clusters (including public, private, edge, etc.) as easily as browsing the Internet
  7. vectorchord Scalable, fast, and disk-friendly vector search in Postgres, the successor of pgvecto.rs.
Type Author/Company Blog URL
Infra Chris Riccomini https://materializedview.io/
Infra Jack Vanlightly https://jack-vanlightly.com/
Math and Science 苏剑林 https://kexue.fm/
AI Infra Colfax https://research.colfax-intl.com/blog/
Postgres Gabriele Bartolini https://www.gabrielebartolini.it/articles/
AI Sebastian Raschka https://magazine.sebastianraschka.com/archive?sort=new
AI Algorithm Tom Yeh https://www.byhand.ai/
AI Infra Chip Huyen https://huyenchip.com/blog/