The team of Professor Nie Zaiqing from AIR, in collaboration with the Tsinghua-affiliated startup ShuiMu Molecule, has developed a large model for single-cell identity understanding called LangCell. This model provides a unified representation of single-cell data and natural language, marking the first model capable of annotating new cell types without the need for labeling.
In addition, LangCell significantly improves performance in various tasks related to cell identity understanding, including batch correction, classification of disease subtypes, and identification of cellular pathways. Even without the use of textual information, the model's incorporated cell encoder module achieves optimal performance across these tasks.
Moreover, LangCell has constructed a cell-natural language text dataset, scLibrary, which comprises approximately 27.5 million entries covering eight dimensions of descriptive information, including cell types, developmental stages, tissues and organs, and diseases, making it a veritable "encyclopedia of cells." The related paper has been accepted for presentation at ICML 2024, and associated work is now open-sourced on GitHub (
GitHub link), allowing researchers and medical professionals worldwide to utilize LangCell for research and exploration.
Paper Link:
https://arxiv.org/abs/2405.06708
Read More: https://air.tsinghua.edu.cn/info/1007/2247.htm