第17届中国R会议 & 2024 X-智能大会 & 2024数据科学国际论坛联合会议

董彬

AI for Mathematics

本报告将重点关注近年来人工智能在辅助数学探索中的一些进展。 首先, 我们将回顾人 工智能为数学研究赋能的背景和一些发展现状,包括机器学习在激发数学家进行前沿探索中的应用。其次,我们将介绍目前正在进行的一些工 作的初步成果。最后,我们将展望人工智能与数学交叉研究领域的未来机遇与挑战。

董彬,北京大学,北京国际数学研究中心教授、国际机器学习研究中心副主任。主要研究领域为机器学习、科学计算和计算成像。2014年获得求是杰出青年学者奖,2022年受邀在世界数学家大会(ICM)做45分钟报告,2023年获得新基石研究员项目,同年获得王选杰出青年学者奖。

刘红升

大模型时代下的AI4Science发展和设想

本次报告回顾AI for Science在各领域的最新业界进展,并介绍华为AI4Sci Lab基于昇腾AI基础软硬件及昇思MindSpore AI框架在大模型赋能各方向的最新研究与未来展望。

中国科学技术大学少年班学院本科,北卡罗莱纳大学教堂山分校统计学博士。现任华为2012实验室昇思MindSpore架构师/AI4Sci Lab负责人。基于昇腾AI基础软硬件及昇思MindSpore AI框架构建了面向AI4Sci领域的MindScience开源框架,覆盖生物、化学、流体、气象、电磁等多个领域。

Songxi Chen

Digital Twin of Economic Systems

A digital twin of a system is a high-precision numerical simulation based on the integration of system models and observational data, representing the pinnacle of understanding of that system. I will discuss the importance and feasibility of establishing a digital twin for the Chinese economic system, as well as the requirements for high spatiotemporal resolution economic datasets and the development of large-scale econometric models.

Dr. Songxi Chen is an Academician of the Chinese Academy of Sciences. He is currently serving as the President of the Chinese Society for Probability and Statistics for the term 2023-2026. Dr. Chen earned his Ph.D. in Statistics from the Australian National University in 1993. Prior to his full-time return to China, he held faculty positions at the National University of Singapore and Iowa State University. From 2010 to 2019, Dr. Chen served as the Founding Director of the Center for Statistical Science at Peking University. His research interests are diverse and include high-dimensional data inference, environmental modeling and assessment, empirical likelihood, statistical and machine learning, and stochastic process inference. Notably, his recent work on air quality assessment and epidemiology has had a significant impact on environmental and public health in China. Dr. Chen is a Fellow of the Institute of Mathematical Statistics (IMS), the American Statistical Association, and the American Association for the Advancement of Science. He is also an elected member of the International Statistical Institute (ISI).

Chuanhai Liu

First Principles of Advanced Data Analysis: the Prediction Principle

This era of big data is fascinating for data analysis in particular and statistics in general. It has also clearly revealed more than ever different scientific attitudes toward data analysis and statistical research from different perspectives. As statisticians, we see both challenges and responsibility for foundational developments in both statistical inference and scientific modeling. This talk introduces a new principle, called the prediction principle. We argue that this principle can serve as a first principle for valid and efficient inference by exploring its implications in three key research directions: (a) how the prediction principle can be used to refine both the principle of maximum likelihood and the likelihood principle, (b) how statistical inference should be formalized, as the required reasoning is deductive, and (c) how a general theory of scientific modeling might be achievable, despite the inherent challenges of inductive reasoning. These discussions are illustrated using seemingly simple but unsolved problems in high-dimensional statistics and deep learning models. To prompt deeper reflections, the talk concludes with a few challenging problems.

Chuanhai Liu earned his correspondence diploma from Central China Normal University in 1985, master's degree in Probability and Statistics from Wuhan University in 1987, and PhD in Statistics from Harvard University in 1994. He worked at Bell Laboratories for ten years starting in 1995 and at Texas A&M as an Associate Professor in Spring 2024. Since 2005, he has been a Professor of Statistics at Purdue University. His research interests include the foundations of statistical inference, statistical computing, and applied statistics. Much of his work on iterative algorithms, such as Quasi-Newton, EM, and MCMC methods, is discussed in his book titled "Advanced Markov Chain Monte Carlo Methods" (2010), co-authored with F. Liang and R. J. Carroll. His work on the foundations of statistical inference, developing a new inferential framework for prior-free probabilistic inference, is included in his book titled "Inferential Models: Reasoning with Uncertainty" (2015), co-authored with R. Martin. For his research on statistical computing, he spent several years experimenting with a multi-threaded and distributed R software system called SupR for big data analysis. Currently, he is working on topics for a potential new book titled "Scientific Modeling: Principles, Methods, and Examples."

阎栋

From Imitation to Emergence: The Journey of Alignment for LLMs

大语言模型的对齐技术在过去两年中迅速发展。除了InstructGPT所采用的Supervised Fine Tuning和Reinforcement Learning with Human Feedback方式之外,Rejection Sampling、Direct Preference Optimization、Identity-Preference Optimization等方法纷纷涌现,为各种目标和条件下的行业落地提供了丰富的工具。但想要用好这些对齐工具,不仅需要对了解各种方法的底层数学原理,而且需要辅以坚实的工程支持。本次分享从对齐技术的理论图景开始,深入对齐技术的工程实践进行讨论,以展望对齐技术的未来收尾。通过对对齐技术全景式的回顾和讨论,帮助听众了解对齐技术的挑战并在业务场景落地。1. Theoretical Landscape of Alignment;2. Practical Data-Centric Process;3. Scaleable Oversight and Beyond。

百川智能强化学习负责人。博士毕业于清华大学计算机系。主要从事决策算法/系统和大语言模型对齐方面的研究。在算法方面,提出了通过奖励分配机制连接无模型和基于模型的强化学习算法的求解框架。在系统方面,作为架构师设计的强化学习编程框架“天授”,在Github获得超过7.4k星标/1.1k二次开发。在ICLR、ICML、IJCAI、AAAI、JMLR、Pattern Recognition等会议/期刊发表论文十余篇。带领团队基于RLHF增强的大语言模型Baichuan3,在4月份的Superclue评测中荣获国内第一。

张舸

高能力全透明开源双语大语言模型 MAP-Neo

第一个工业级的透明中英文双语大模型-Neo的开源,我们提供了全部的4.7T 预训练数据,训练pipeline,基于spark的预训练数据pipeline,OCR pipeline,以及复现deepseek-math提出的迭代地从预训练数据中召回高质量数据的直接可用的pipeline。我们的模型在7B大小,与OLMo和Amber相比,Neo作为基座的性能基本达到了可比工业级SOTA的水准。

加拿大滑铁卢大学博士生,M-A-P社区发起人,COIG系列工作的发起人。

吉嘉铭

大模型对齐的机理和高效对齐微调技术

围绕大模型的对齐机理和高效对齐微调技术展开汇报。从机理上探究:模型是否抗拒对齐,拒绝被改变,模型在预训练塑造的意图与价值是否能够在对齐阶段被修改。相比预训练阶段的数据量和参数更新次数,对齐所需的数据量和参数更新显著更少。即使是经过精细化对齐的模型也很容易被故意或无意地规避。本报告探讨了大模型参数中是否存在弹性,以及对齐是否真正改变了模型的内在特性,还是仅仅只是表面对齐,以及如何不通过RLHF实现高效的大模型对齐技术。

北京大学人工智能研究院博士生,导师是杨耀东助理教授,主要从事大模型的安全与价值对齐方面的研究,获得首批国家自然科学基金青年学生基础研究项目(博士研究生)资助,北京大学校长奖学金获得者,详见个人主页:jijiaming.com。

余天予

MiniCPM-V

迈向实用多模态大模型的路上存在许多阻碍,MiniCPM-V系列模型通过 MiniCPM 高效 scaling law,VisCPM 跨语言泛化、RLHF-V 和 RLAIF-V 可信行为学习、LLaVA-UHD 高清图编码等技术在端侧实现了接近 GPT-4V 级别的效果。

清华大学自然语言处理实验室博士生,主要从事多模态大模型相关工作。

张松阳

Open-Compass的大模型评测实践

评测是大模型研发的指南针,如何全面、科学、客观的评测大模型的能力是产学研各界都关心的重点问题。OpenCompass旨在从能力体系、工具链、评测数据和模型榜单多个维度对大模型评测进行体系建设和技术研发。本演进将介绍OpenCompass在大模型评测上的具体实践和相关思考。

上海人工智能实验室青年研究员,OpenCompass技术负责人

莫欣

让开发者文思泉涌:一个好的AI应用开发框架应该为开发者提供的基本体验

从利用Agently开发框架开发的开源项目说起,带领开发者逐层拆解Agently AI应用开发框架在帮助开发者完成项目开发落地过程中,从模型单次请求的能力放大脚手架,到代码级工作流编排管理能力,这些不同的特性都将如何帮助开发者顺畅、高效将思路转化为高可用、高质量的业务代码的

北京智体纪元科技有限公司创始人,Agently AI应用开发框架项目负责人,前光年之外开发者生态产品经理

黄志国

langchain-chatchat:开源、可离线部署的检索增强生成(RAG)大模型知识库项目

一种利用 langchain 思想实现的基于本地知识库的问答应用,目标期望建立一套对中文场景与开源模型支持友好、可离线运行的知识库问答解决方案。依托于本项目支持的开源 LLM 与 Embedding 模型,本项目可实现全部使用开源模型离线私有部署。与此同时,本项目也支持 OpenAI GPT API 的调用,并将在后续持续扩充对各类模型及模型 API 的接入。
⛓ 本项目实现原理如下图所示,过程包括加载文件 -> 读取文本 -> 文本分割 -> 文本向量化 -> 问句向量化 -> 在文本向量中匹配出与问句向量最相似的 top k个 -> 匹配出的文本作为上下文和问题一起添加到 prompt中 -> 提交给 LLM生成回答。

南开大学精算学博士,浙江大学科学技术研究院博士后,langchain-chatchat核心开发组成员

唐飞虎

长文本应用如何推理加速

推理性能是目前长文本大模型的瓶颈之一,本文介绍各种可利用在长文本模型中的推理加速技术,重点介绍目前在实践中使用将多的方法。在典型的 AI 工作流中,您可能会将相同的输入令牌反复传递给模型。使用上下文缓存功能,您可以将一些内容传递给模型一次,缓存输入令牌,然后引用缓存的令牌以用于后续请求。在某些数量下,使用缓存的令牌比重复传入同一语料库的令牌更低的费用(并且延迟更低)。

月之暗面 高级研发工程师、开发者关系负责人

冯伟健

如何制造合成数据进行模型SFT

模型SFT是一个非常重要的工作,但通常受制于数据量的大小和质量会影响到SFT效果。本分享将重点介绍SFT数据的不同分类、如何合成可以使用的SFT数据、以及合成数据与真实数据的配比

为明教育集团AI部门负责人、明日知己教育科技CTO、香港中文大学大一休学

孙一乔

AIxEdu:超级应用的最佳赛道,迈向AGI的数学推理练兵场


AGI的超级应用时代即将到来,中国的超级应用将引领全球。教育行业因其刚需性,是最有可能率先跑出超级应用的赛道之一。
数学推理能力是通向通用人工智能(AGI)的关键,而教育行业恰恰是LLM提升数理推理能力的最佳练兵场。
最重要的是,AI对教育的变革,非常契合并支持国家政策和发展方向,能够让每个孩子都拥有普惠的个性化AI教师。

- 悉之智能创始人,致力于用AI革新教育行业,7年AI解题和讲解经验,AI教师大模型多次刷新行业SOTA,广泛应用于国内外,获得启明、经纬、真格、新东方等一线VC 3000万美元融资。
- 旗下北美产品用户超过百万,ARR超百万美金,积累千万题目数据,App Store评分4.8+,独有行业最高的解题率和多模态互动AI讲解功能。国内赋能教育公司AI升级,与新东方优编程合作开发U-shannon大模型,与紫光合作推出AI答疑普惠平台,致力于用AI实现真正的个性化、普惠教育。

胡修涵

Building For Fun Agents

AGI时代的娱乐应用以For Fun Agents作为基础服务单元,而其中Agents的打造和建设可能需要大众广泛的参与。
如何通过系统性的建设角色内容框架和事件,构造出未来最值得与之互动和可以高效产出优质内容的Agents,是捏Ta这个平台的核心价值。

北京大学智能科学本科,哥伦比亚大学硕士,机器人实验室研究生。曾任Facebook视频产品Tech Lead,阿里巴巴研发团队负责人,特赞(上海)信息科技有限公司技术副总裁,系统性发布特赞内容数字资产管理系统(DAM)并带领团队完成产品收入过亿。2022年创办看见概念(上海)智能科技有限公司,打造AI驱动的幻想创作平台“捏Ta”。

Connor Wang

Agent in Action

Action Agent 的实现与应用. 如何将基于AI 应用与移动互联网的产品和经济生态整合, 让 LLM 的“动脑”能力赋能 classic engineering 的“动手”能力. 让 AI 帮你做每天无聊又必须做的事情.

Founder & CEO @ Six AI.

祝海林

大模型时代下的AI数据库 Byzer 和编程工具 Chat-Auto-Coder

大模型是AI发展的一个里程碑,它正在改变社会的方方面面。Byzer AI数据库,使用 SQL 作为交互语言,创新性可以将主流大模型注册成UDF来使用,同时具备预训练,微调,部署大模型能力,帮助企业快速的在诸如ETL,数据分析,流式计算(风控)以及APP应用等各种场景中使用大模型。Chat-Auto-Coder 可以帮助用户实现对已有项目(包括SQL类项目)的阅读和迭代,用户甚至可以不打开编辑器即可完成对代码的修改和测试。

Byzer社区PMC/资深数据架构师/Kyligence技术合伙人,拥有15+年研发经验。一直专注在Data + AI 融合方向上,致力于帮助企业更好的落地 Data+AI。个人热衷于开源产品的设计和研发,Byzer/MLSQL 为其主要开源作品,最新产品 auto-coder 超越自动代码补全的编程工具,旨在帮助企业获得倍数级别研发效率提升。Byzer AI数据库 获得22年中国开源创新大赛二等奖,23年浦东新区人工智能创新大赛一等奖, 个人入选中国22年开源先锋33人,荣获23年全球人工智能开发者先锋大会「开发者先锋」称号。

吴智楷

Modelscope-Agent开源框架:功能完备且生产可落地的Agent框架

在大模型时代,Agent作为未来可能的落地场景,受到了广泛的关注。为了满足复杂多变的用户需求和生产场景,Modelscope-Agent框架积极与Modelscope等开源社区生态结合,提供了包括可定制Agent,开源大语言模型支持,API服务集成,分布式多智能体任务等完备的功能。用户仅需若干行代码,即可定制自己的Agent,并在相应场景中落地使用。

阿里巴巴魔搭社区Modelscope-Agent框架开发者

张先轶

针对国产硬件优化的PerfXLM推理引擎与PerfXCloud推理云系统

当前智算中心N卡与非N卡的混合架构下,如何充分发挥国产卡计算资源,达到可用、好用、高效利用是亟需解决的问题。我们提出了PerfXLM大模型推理引擎与PerfXCloud推理云系统,主要针对多种国产GPU和NPU硬件进行适配与优化,已经完成语言模型,embedding模型等多种主流模型的迁移与适配。

张先轶,本科和硕士毕业于北京理工大学,博士毕业于中国科学院大学,曾于中科院软件所工作,之后分别在UT Austin和MIT进行博士后研究工作。国际知名开源矩阵计算项目OpenBLAS发起人和主要维护者。中国计算机学会高性能计算专业委员会委员,ACM SIGHPC China执行委员。2016年,创办PerfXLab澎峰科技,提供异构计算软件栈与解决方案。获得2016年中国计算机学会科学技术二等奖,2017年中国科学院杰出科技成就奖,2020年美国SIAM Activity Group on Supercomputing最佳论文奖,2023年北京市自然科学二等奖,2023 Bench Council世界开源贡献奖。

袁进辉

大模型部署成本降低10000倍之路

ChatGPT 背后的大模型技术最为新一轮的技术变革已经成为共识,各种 AI 原生应用呼之欲出,预期在不久的未来,AI 将在我们的工作和生活中无处不在。目前限制 AI 应用发展的一个主要因素是大模型部署的成本,这次分享将和大家探讨如何解决目前AI 应用快速爆发与算力资源短缺推理成本昂贵的矛盾,是否有机会将大模型推理成本降低10000倍,加速AGI时代的到来。

袁进辉,2003年于西安电子科技大学(Xidian University)计算机专业获得学士学位,2008年于清华大学计算机系获得工学博士学位,清华大学优秀博士学位论文奖获得者,2008~2011年在清华博士后期间开展计算神经科学方面的研究,2013~2016年他任微软亚洲研究院主管研究员(Lead Researcher),负责研发大规模机器学习系统LightLDA并服务于微软产品。2016年~2023年,他发起和主导研发了开源深度学习框架OneFlow,在分布式深度学习系统编程易用性和高效性方向设计了一系列新方法,并为工业界广泛采用。目前他的研究领域为AI Infrastructure,致力于通过算法、系统、硬件协同设计研发大模型推理加速引擎,降低大模型应用成本和开发门槛。

黄锦涛

SWIFT工具箱:简化大模型应用之旅

当前,大语言模型和多模态大模型正逐步成为推动技术创新和应用的关键力量。然而,如何有效整合这些多元的模型,特别是在多模态领域,以提供简洁且统一的使用接口,对许多从业者而言是一项棘手的挑战。为此,我们提供了SWIFT:一个旨在简化大模型使用的工具箱。SWIFT支持250+大语言模型和35+多模态大模型的微调、人类反馈对齐、推理、评估、量化和部署,包括:Qwen、Llama、GLM、Internlm、Yi、百川、DeepSeek、Llava等系列模型。除此之外,我们构建了丰富的Adapter库,汇集了包括LoRA+、GaLore、Llama-Pro在内的最新训练技术,作为PEFT轻量级训练方案的补充。同时,SWIFT提供了基于Gradio的Web-UI界面和众多最佳实践,帮助研究者和开发者轻松上手大模型的微调与应用。

魔搭社区swift框架开发者

Kim

GPU 在量化投资中的优势及应用

2007 年 英伟达发布 CUDA 编程范式以来,经过 17 年的发展,GPU 在算力和显存都已经逐步远超通用 CPU 的能力。
量化投资领域一直走在技术的最前沿,原有用 CPU 来进行的高性能计算的程序,也逐步在切换到使用 GPU 来加速的模式。
这里将介绍日常工作中 GPU 的应用场景,实际开发中遇到的问题,以及分享 GPU 提升对应业务效率的具体案例。

Kim 就职于头部量化私募,负责量化交易低延时,高性能计算系统的有关开发工作。

许以言

面向有组织科研的的模型生命周期管理

随着有组织科研的快速发展,数据信息与数据价值正以更高维的形式体现在模型中,数据分析的过程也需要由多领域专家参与其中,本报告将围绕空间数据智能分析场景的模型生命周期管理流程,介绍 ModelOps 方法,并探讨面对交叉领域研究场景,如何通过平台化的工具与社区化的方法支撑有组织科研。

许以言,和鲸科技产品专家,专注于数据驱动研究与 AI for Science 场景的数据科学平台产品设计与方法创新,参与了 ModelWhale 数据科学协同平台在气象、地质、遥感、空间科学、临床等众多科研智能领域的落地,对数据智能场景的多角色协同研究流程有独到的见解与丰富的经验积累。

刘思喆

因果推断技术在工业界的实践应用

本报告围绕工业界中因果推断的核心价值展开,探讨其在产品优化、市场策略调整、供应链管理等业务中的重要性。本报告也将尝试系统梳理常见的因果推断技术,包括随机实验、倾向得分匹配、断点回归分析、合成控制等方法,并探讨它们之间的内在联系、适用场景及其潜在局限。通过剖析企业中的真实业务案例,我们将生动展示,如何利用这些方法提炼出精准的因果洞见,持续赋能企业的高质量决策的完整过程。

刘思喆,统计之都理事会成员。先后在彩票、电信、电商、教培、交通、餐饮行业从事算法、数据科学、营销赋能等相关工作。曾任 51Talk 数智中心助理副总裁、首席数据科学家,也曾任京东推荐平台部高级经理 ,京东技术名人堂成员之一。中国人民大学大数据分析实验班、首经贸信息学院校外硕士生导师。国内 R 语言的布道者,21 年的使用经验,《153 分钟学会 R》的作者,《R 语言核心技术手册》的译者。

张丹

数据分析落地的最佳实践

现在我们正处于大数据时代,处处都产生数据,大部分数据已经不在稀缺,分析方法和算法模型都也写在了教课书中。
如何挖掘出数据的价值,让数据分析落地,把数据价值转换为自身价值,是数据分析师核心要考虑的。

数据分析要解决实际业务场景问题,伪需求、不清晰的目标,都会造成项目失败。数据分析不只是指标体系、更不是指标堆积,市场在变,数据也在变,我们的知识结构也要跟着变化。
数据分析是跨学科的工作,对人的要求也越来越高,调包侠的时代已过。要以新的视角,看数据、看业务、看技术发展、看我们自己,适应变化,才能把项目做好、落地。

张丹,R语言实践者,北京青萌数海科技有限公司CTO,微软MVP。
10年以上互联网应用架构经验,在R、大数据、数据分析等方面有深厚的积累。精通量化投资交易策略,熟悉中国金融二级市场、交易规则和投研体系。 熟悉数据学科方法论,在海关、药监、外汇等监管科技领域均有落地项目。
著有《R的极客理想:量化投资篇》、《R的极客理想:工具篇》、《R的极客理想:高级开发篇》,图书英文版被CRC出版集团引进,在美国发行。个人博客:http://fens.me 。

朱赛赛

统计数据大模型的应用场景和解决方案的探索与实践

本次报告介绍了以应用和服务为导向如何构建统计数据治理体系,并基于多维度、高质量海量统计数据与华知大模型,进行数据问答、数据解读、专业数据分析模型、数据分析报告的规划设计,用以解决数值型数据的深度应用。

朱赛赛,同方知网图书工具书与志鉴产品总监。2014年加入中国知网,2019年至今负责经济社会大数据产品的运营、市场推广及项目支持等工作,为国内外千余家高校、科研及企事业单位提供服务和支持。基于多年在农业部、统计局等系统的项目合作实践,积累了丰富的统计数据采集、治理、管理及业务应用经验。

王小宁

智能教育革命:如何借助大语言模型改善统计和数据科学教学

本次分享将探讨如何利用大语言模型来革新统计和数据科学的课程教学,将从传统统计教学方法的挑战和局限开始讨论,引入数字化教育的重要性。重点介绍中国传媒大学数据科学教学团队围绕大语言模型+Agent在这领域的一些探索,同时介绍基于大语言模型的AI助教-书卷侠(https://scholarhero.cn/),展示其如何通过智能化解答和个性化教学资料来提升教学效果和学习体验。我们还会探讨这种技术在《数据科学导论》等课程中的具体应用,并展望未来教育技术的发展趋势,讨论这些新技术在教育实践中的潜在应用,以及它们对未来数据科学教育格局的深远影响。

王小宁,现为中国传媒大学数据科学与智能媒体学院副教授,大语言模型智能体书卷侠负责人,硕士生导师,中国商业统计学会理事,中国人民大学中国调查与数据中心研究员,中国商业统计学会人工智能分会(筹)秘书长,统计之都秘书长,中国人民大学统计学博士,研究方向为大语言模型、抽样设计、统计机器学习和文本挖掘。

冯晟洋

人工智能+数字员工企业最佳实践

• 市场痛点:企业用工成本不断增加,员工难以处理海量数据和繁琐任务

• 解决方案:构建数字机器人扮演数字员工帮助企业降本增效,扮演数字助理提高个人生产力

• 业务介绍:提供AI应用开发和服务,以NLP作为大脑,RPA作为双手,为组织和个人构建数字机器人

• 商业模式:本地化+远程部署

• 竞争优势:为广泛客户提供更具实操性、更具性价比的数字化转型方案和应用

官网:https://www.bluelsqkj.com/robot-development

上海蓝衫科技有限公司Blueshirt Technology 的联合创始人,GPT元宇宙创始人、渗透智能 – ShirtAI创始人

秦旭

A Causal Investigation of Heterogeneity in Mediation Mechanisms in Multisite Randomized Trials

Multisite randomized trials have been pervasive in the past three decades. The importance of investigating the variation in the total impact of an intervention has become increasingly valued. An intervention may generate heterogeneous impacts due to natural variations in participant characteristics, context, and local implementation. Important research questions include whether the intervention impact is generalizable across individuals and contexts, for whom and under what contexts the intervention is effective, and why. To advance this line of research, this study develops a method to assess the mediation mechanism underlying the total impact of the intervention in multisite randomized trials and how it varies by individual and contextual factors. The findings may help practitioners improve and tailor intervention designs and implementations for different individuals and contexts. The method is evaluated through comprehensive Monte Carlo simulations. It is also applied to the National Study of Learning Mindsets for evaluating the mediation mechanism underlying the impact of a growth mindset intervention on math performance and its heterogeneity.

Dr. Xu Qin is an Assistant Professor of Research Methodology at the School of Education (primary) and an Assistant Professor of Biostatistics at the School of Public Health (secondary). She holds a Ph.D. from the Department of Comparative Human Development at the University of Chicago and a B.S. and an M.S. in Statistics from the Renmin University of China.
Her research focuses on solving cutting-edge methodological problems in causal mediation analysis and multilevel modeling. She is also interested in using rigorous and innovative quantitative methods to evaluate the impacts of interventions and the underlying mechanisms. Methodologically, she has developed statistical methods and software for investigating the heterogeneity in causal mediation mechanisms in both multilevel and single-level settings, as well as sensitivity analysis and power analysis methods for causal mediation analysis. Substantively, she is interested in applying advanced statistical methods in developmental, educational, and health research.
Dr. Qin has served as the Principal Investigator or Co-Principal Investigator for grants funded by the Spencer Foundation, the National Science Foundation, and the U.S. Department of Education’s Institute of Education Sciences. She is a recipient of the 2024 NSF CAREER award and the 2022 National Academy of Education/Spencer Postdoctoral Fellowship.

王杰彪

Heterogeneous Causal Mediation Analysis with Bayesian Additive Regression Trees

Causal mediation analysis can help explain the mechanism of how an exposure affects an outcome. The mediation effects are often heterogeneous based on individual characteristics, but most existing methods ignore this heterogeneity and estimate the population average effects. To address this gap, we develop a heterogeneous causal mediation analysis method using Bayesian Regression Tree Ensembles. Distinct from traditional methods, our approach captures complex non-linear interactions and heterogeneous effects in mediation processes more flexibly, offering a refined understanding of the heterogeneity of causal mechanisms. By sampling from the posterior trees of mediator and outcome models, we are able to obtain rigorous credible intervals for causal mediation effects. We also use partial dependent plots to illustrate which moderators play more important roles and how each effect changes with a moderator. Utilizing simulated datasets, we demonstrate the superiority of our approach in accurate estimation and inference of heterogeneous mediation effects, especially in scenarios characterized by non-linear relationships and interaction effects.

Assistant Professor of Biostatistics and Clinical and Translational Science at the University of Pittsburgh

洪光磊

Organizational Effectiveness: A New Strategy to Leverage Multisite Randomized Trials for Valid Assessment

In education, health, and human services, an intervention program is usually implemented by many local organizations. Determining which organizations are more effective is essential for theoretically characterizing effective practices and for intervening to enhance the capacity of ineffective organizations. In multisite randomized trials, site-specific intention-to-treat (ITT) effects are likely invalid indicators for organizational effectiveness and may lead to inequitable decisions. This is because sites differ in their local ecological conditions including client composition, alternative programs, and community context. Applying the potential outcomes framework, this study proposes a mathematical definition for the relative effectiveness of an organization. The estimand contrasts the performance of a focal organization with those that share the features of its local ecological conditions. The identification relies on relatively weak assumptions by leveraging observed control group outcomes that capture the confounding impacts of alternative programs and community context. Simulations demonstrate significant improvements when comparing with site-specific ITT analyses or analyses that adjust for between-site differences in the observed baseline participant composition only. We illustrate its use through an evaluation of the relative effectiveness of individual Job Corps centers by reanalyzing data from the National Job Corps Study, a multisite randomized trial that included 100 Job Corps centers nationwide serving disadvantaged youths. The new strategy promises to alleviate severe misclassifications of some of the most effective Job Corps centers as least effective and vice versa.

Guanglei Hong is Professor in the Department of Comparative Human Development (https://humdev.uchicago.edu/) at the University of Chicago. She was the Inaugural Chair of the University-wide Committee on Quantitative Methods in Social, Behavioral, and Health Sciences (https://voices.uchicago.edu/qrmeth/) and is a member of the Committee on Education (https://voices.uchicago.edu/coed/). She attained a master's degree in Applied Statistics in 2002 and a Ph.D. in Education in 2004 from the University of Michigan. Before joining the University of Chicago faculty in July 2009, she had been an Assistant Professor in the Human Development and Applied Psychology Department in the Ontario Institute for Studies in Education of the University of Toronto (OISE/UT). Prof. Hong has focused her research on developing causal inference theories and methods for understanding the impacts of large-scale societal changes and the effects of social and educational policies and programs on child and youth development. She has contributed original concepts and developed multiple methods for drawing valid inferences about causal relationships, for investigating heterogeneity in responses to external interventions across individuals and contexts, and for rigorously testing theories about the mechanisms through which such exposures generate impacts. Her book “Causality in a social world: Moderation, mediation, and spill-over” was published by Wiley in 2015. She guest edited the Journal of Research on Educational Effectiveness special issue on the statistical approaches to studying mediator effects in education research in 2012. Additionally, through publishing in first-tier statistics, education, psychology, sociology, and public policy journals and disseminating new methods through workshops and training institutes, her research has generated a broad impact among quantitative methodologists as well as applied researchers. She has received research and training grants from the National Science Foundation, the U.S. Department of Education, the William T. Grant Foundation, the Spencer Foundation, and the Social Sciences and Humanities Research Council of Canada among other funding agencies. She was awarded a prestigious John Simon Guggenheim Memorial Foundation Fellowship in 2021. For more information, please visit her website: https://humdev.uchicago.edu/directory/guanglei-hong.

解海天

Data-driven Policy Learning for a Continuous Treatment

This paper studies policy learning under the condition of unconfoundedness with a continuous treatment variable. Our research begins by employing kernel-based inverse propensity-weighted (IPW) methods to estimate policy welfare. We aim to approximate the optimal policy within a global policy class characterized by infinite Vapnik-Chervonenkis (VC) dimension. This is achieved through the utilization of a sequence of sieve policy classes, each with finite VC dimension. Preliminary analysis reveals that welfare regret comprises of three components: global welfare deficiency, variance, and bias. This leads to the necessity of simultaneously selecting the optimal bandwidth for estimation and the optimal policy class for welfare approximation. To tackle this challenge, we introduce a semi-data-driven strategy that employs penalization techniques. This approach yields oracle inequalities that adeptly balance the three components of welfare regret without prior knowledge of the welfare deficiency. By utilizing precise maximal and concentration inequalities, we derive sharper regret bounds than those currently available in the literature. In instances where the propensity score is unknown, we adopt the doubly robust (DR) moment condition tailored to the continuous treatment setting. In alignment with the binary-treatment case, the DR welfare regret closely parallels the IPW welfare regret, given the fast convergence of nuisance estimators.

解海天2023年毕业于美国加州大学圣地亚哥分校。主要研究方向为因果推断理论,包括工具变量、断点回归等因果推断方法的非参数/半参数识别与估计,以及基于因果模型的政策分析评估、策略学习与统计决策等。研究成果发表于Journal of Business and Economic Statistics, Oxford Bulletin of Economics and Statistics等国际期刊。

马慧娟

Quantile Regression Models for Compliers in Randomized Experiments with Noncompliance

Understanding the causal effect of a treatment in randomized experiments with noncompliance is of fundamental interest in many domains. Utilizing the instrumental variable (IV) framework, compliers are the only subpopulation that closely relevant to the assessment of causal treatment effect. In this pape# we study flexible quantile regression models for compliers with and without treatment. We establish unbiased estimating equations by investigating the relationship between observed data and latent subgroup indicators. A novel iterated algorithm is proposed to solve the discontinuous equations that involve unknown parameters in a complicated way. The complier average treatment effect and quantile treatment effects can be estimated. The consistency and asymptotic normality of the proposed estimators are established. Numerical results, including extensive simulation studies and real data analysis of the Oregon health insurance experiment, are presented to show the practical utility.

华东师范大学统计学院与统计交叉科学研究院副教授。中国科学技术大学统计学博士,美国埃默里大学博士后。主要研究方向包括生存分析,分位数回归,因果推断等。在统计学期刊《Biometrika》,《Biometrics》,《Journal of Business & Economic Statistics》和《Statistica Sinica》等期刊发表论文二十余篇。曾主持国家自然科学基金青年项目一项和上海市浦江人才项目一项。现主持国家自然科学基金重点项目子项目一项,参与国家自然科学基金重点项目及科技部重点研发项目等。担任中国现场统计研究会生存分析分会理事。

王林勃

The promises of multiple outcomes

A key challenge in causal inference from observational studies is the identification and estimation of causal effects in the presence of unmeasured confounding. In this pape# we introduce a novel approach for causal inference that leverages information in multiple outcomes to deal with unmeasured confounding. The key assumption in our approach is conditional independence among multiple outcomes. In contrast to existing proposals in the literature, the roles of multiple outcomes in our key identification assumption are symmetric, hence the name parallel outcomes. We show nonparametric identifiability with at least three parallel outcomes and provide parametric estimation tools under a set of linear structural equation models. Our proposal is evaluated through a set of synthetic and real data analyses.

Linbo Wang is an associate professor in the Department of Statistical Sciences and the Department of Computer and Mathematical Sciences, University of Toronto. He is also a faculty affiliate at the Vector Institute, a CANSSI Ontario STAGE program mento# and affiliated with the Department of Statistics, University of Washington, and Department of Computer Science, University of Toronto. Prior to these roles, he was a postdoc at Harvard T.H. Chan School of Public Health. He obtained his Ph.D. from the University of Washington. His research interest is centered around causality and its interaction with statistics and machine learning.

Yue Liu

Quantifying Individual Risk for Binary Outcome:Bounds and Inference

Understanding treatment heterogeneity is crucial for reliable decision-making in treatment
evaluation and selection. While the conditional average treatment effect (CATE) is commonly used to capture treatment heterogeneity induced by covariates and design individualized treatment policies, it remains an averaging metric within subpopulations. This limitation prevents it from unveiling individual-level risks, potentially leading to misleading results. This article addresses this gap by examining individual risk for binary outcomes, specifically focusing on the fraction negatively affected (FNA) – a metric assessing the percentage of individuals experiencing worse outcomes with treatment compared to control. Under the strong ignorability assumption,FNA is unidentifiable, and we find that previous Fr´echet-Hoeffding bounds are usually wide and unattainable in practice. By introducing a plausible positive correlation assumption for the potential outcomes, we obtain significantly improved bounds compared to previous studies. We show that even with a positive and statistically significant CATE, the lower bound on FNA can be positive, i.e., in the best-case scenario many units will be harmed if receiving treatment. Additionally, we establish a nonparametric sensitivity analysis framework for FNA using the Pearson correlation coefficient as the sensitivity paramete# thereby exploring the relationships among the correlation coefficient, FNA, and CATE. We also present a practical and tractable method for selecting the range of correlation coefficients. Furthermore, we propose flexible estimators for the refined FNA bounds and prove their consistency and asymptotic normality. Extensive simulations are conducted to evaluate the effectiveness of the proposed estimators. We apply our method to the right heart catheterization (RHC) data to explore the percentage of patients harmed by RHC.

刘越,中国人民大学讲师,2019年博士毕业于北京大学。多篇文章发表于Journal of Machine Learning Research(JMLR), Artificial Intelligence(AIJ), IEEE Transactions on Knowledge and Data Engineering(TKDE), IEEE Transactions on Neural Networks and Learning Systems(TNNLS), International Conference on Machine Learning(ICML), Knowledge Discovery and Data Mining(KDD),The Conference on Uncertainty in Artificial Intelligence(UAI)等机器学习与统计学期刊及会议。
研究兴趣主要包括因果推断,贝叶斯网络以及基于因果推断的机器学习算法等。

Wenxuan Zhong

MedReader: a query-based multisource AI learner of medical publications

As the volume and velocity of medical publications have increased at an unprecedented pace, a computational-based learning system is essential to avoid expensive and time-consuming human annotations which in general hinders the deployment of novel therapeutic methods in clinical practice. To achieve this goal, we develop Medreade# a novel multi-channel learning system
that can summarize (topic learning), understand (knowledge-graph constructing) and generalize (hypothesis generating) knowledge simultaneously from query-related publications. As with human learne# Medreader can access how faithfully a discovered concept is by using data beyond
publications and conducting an novel enrichment analysis. We applied Medreader to a covid-19 related publication set, which include 4,117 abstracts that are deposited into MEDLINE database from 1/1/2020 to 4/30/2020. The hypothesis generated from the 4,117 publications significantly overlapped with the hypothesis that appeared in subsequent publications.
For example, 71% of the predicted gene-gene interactions and 100% of the predicted disease-disease interactions are enriched in subsequent articles. Moreove# the whole learning process only takes 3 minutes-a negligible time-frame for clinical practice. Our analysis show that this system can help us to learn from publications at an unprecedented speed and scale. Such learning s ystem can help us to learn from publications at an unprecedented speed and scale. Such learning system not only help us promptly summarize but also affords opportunity for discovery.

Dr. Zhong is an Athletic Association Professor in the Department of Statistics at the University of Georgia. She holds a B.S. in Statistics from Nankai University, China, and a Ph.D. in Statistics from Purdue University. After completing her Ph.D., Dr. Zhong pursued a postdoctoral fellowship in Statistics and Computational Biology at Harvard University. She served as an Assistant Professor in the Department of Statistics at the University of Illinois at Urbana-Champaign from 2007 to 2013, before joining the University of Georgia in 2013. Dr. Zhong is a ASA Fellow and an elected Fellow of the International Statistical Institute. She is the co-Director of the big data analytics lab.

Liping Tong

Statistical Research Projects Using Electronic Health Records

With the advent of electronic medical records (EMR), hospitals find themselves overwhelmed with vast quantities of patient data with diverse applications. Given the critical nature of medical data storage and utilization, numerous specialized companies such as Epic, Oracle, and Cerner have emerged. Moreove# hospitals typically employ their own cadre of experts including statisticians, data analysts, and data scientists. Data analysis in hospitals spans a spectrum, ranging from fundamental tasks like data summarization and demonstration using tables and plots to more intricate efforts involving the refinement and creation of statistical methods and models. In this presentation, I will illustrate the necessity of connecting time-dependent survival models with logistic models through a compelling example. Additionally, I will underscore the significance of selecting the most suitable analytical tool to maximize insights from data, drawing from a concrete case study.

Liping Tong is currently a senior statistician in Advocate Aurora Health, leading a team of research and analysis. Liping got her B.A. in 1997 from the Department of Mathematics, Nankai University. She had two years of graduate school in Nankai before going to the Department of Statistics, University of Chicago in 1999. Liping got her PhD in statistics in 2004 and started to work as a research associate in the Department of Statistics, University of Washington. Starting from 2007, she became an Assistant Professo# Department of Mathematics, Loyola University Chicago. In 2010, she switched to the Department of Public Health Sciences, Loyola University, Stritch School of Medicine. In 2015, she started her career in Advocate Aurora Healthcare, as a senior statistician. The main responsibilities are:
1. Lead the development of prediction models based on millions of patients’ electronic medical records for questions such as readmission risk or chronic disease management. Statistical and computational methods, such as logistic models, hierarchical models, survival analysis, support vector machine, random forest and boosting methods, are used to optimize predictions.
2. Lead the analysis on the evaluation of interventions to reduce adverse events such as emergency department visits and 30-day readmissions after hospitalization. Cox Proportional Hazard models with time dependent covariates are applied in the analysis.
3. Mentor interns, junior statisticians, and data analysts on multiple projects, including evaluation of the program of Palliative Care, application of deep learning and big data strategy in medical science, and so on.
4. Involve in other team members’ projects as a reliable source of expert support.
In addition, Liping has an active collaboration with the professors from the Department of Psychiatry, University of Illinois at Chicago since 2020. The main interest is in the data collected for the Chicago Follow-up Study (CFS) that was designed as a naturalistic prospective longitudinal, multi-follow-up research study to investigate the course, outcome, symptomatology, effects of medication, and recovery in participants with serious mental illness disorders. Statistical methods, such as logistic generalized estimating equation (GEE) models, the latent class analysis (LCA), network analysis and clustering methods, have been applied for a wide range of hypotheses of interest.

Zhezhen Jin

On detecting the effect of exposure mixture

To study the effect of exposure mixture on the continuous health outcomes, one can use the linear model with a weighted sum of multiple standardized exposure variables as an index predictor and its coefficient for the overall effect. The unknown weights typically range between zero and one, indicating contributions of individual exposures to the overall effect. Because the weight parameters present only when the parameter for overall effect is non-zero, testing hypotheses on the overall effect can be challenging, especially when the number of exposure variables is above two. This paper presents a working model based approach to estimate the parameter for overall effect and to test specific hypotheses, including two tests for detecting the overall effect and one test for detecting unequal weights when the overall effect is evident. The statistics are computationally easy and one can apply existing statistical software to perform the analysis. A simulation study shows that the proposed estimators for the parameters of interest may have better finite sample performance than some other estimators.

Zhezhen Jin is Professor of Biostatistics in the Department of Biostatistics in Mailman School of Public Health at Columbia University. He received his BS and MS in probability and statistics from Nankai University in 1989 and in 1992 respectively, MA in applied mathematics from the University of Southern California in 1994 and Ph.D. degree in Statistics from Columbia University in 1998. After 1998-2000 two years of postdoctoral studies at Harvard School of Public Health, he returned to Columbia as a faculty member in the Department of Biostatistics in 2000. He has been conducting statistical and biostatistical methodological research on resampling methods, survival analysis, nonparametric and semiparametric methods, smoothing methods, and statistical computing. He has also been collaborating with clinical investigators to address statistical issues in neurology, cardiology, oncology, transplantation, psychiatry, pathology and alternative medicine. He was a co-founding editor of the Contemporary Clinical Trials Communication. He is Statistical Editor for the Journal of American Cardiology College—Cardiovascular Imaging. He has served as an associate editor for several statistical journals including Journal of American Statistical Association, Statistica Sinica, Lifetime Data Analysis, Communications for Statistical Applications and Methods, Journal of Statistical Theory and Practice, and is on the editorial board for Kidney International, the Journal of the International Society for Nephrology. He received Career Award from the National Science Foundation in 2002. He is a Fellow of the American Statistical Association, a Fellow of the Institute of Mathematical Statistics, and an elected member of International Statistical Institute. He served as the President of the International Chinese Statistical Association (ICSA) in 2022.

Ju-Young Park

Fitting an Accelerated Failure Time Model with Time-dependent Covariates via Nonparametric Mixture

An accelerated failure time (AFT) model is a popular regression model in survival analysis. It models the relationship between
the failure time and a set of covariates via a log link with an addition of a random error. The model can be either prametric or
semiparametric depending on the degree of sepcification of the error distribution. The covariates are usually assumed to be
fixed - 'time independent'. In many biomedical studies, however, 'time-dependent' covariates are frequently observed. In this
work, we consider a semiparametric time-dependent AFT model. We assum that the distribution of the baseline failure time as
an infinite scale mixture of Gaussian densities. Thus, this model is higly flexible compared to that assumes a one-component
parametric density. We consider a maximum lik
elihood estimation and propose an algorithm based on the constrain newton method for estimating model parameters and
mixing distributions. The proposed methods are investigated via simulation studies to assess the finite sample properties. The
proposed methods are illustrated with a real data set.

I am a Ph.D. student majoring in Applied Statistics at Yonsei University in South Korea. I am conducting research on Survival
Analysis under the guidance of my advisor Prof. Sangwook Kang. My research focuses on survival models that take
time-dependent coviarates into account. Thank you for inviting me to this valuable opprotunity.

Danyang Huang

Subsampling Spectral Clustering for Stochastic Block Models in Large-Scale Networks

The rapid development of science and technology has generated large amounts of network data, leading to significant computational challenges for network community detection. A novel subsampling spectral clustering algorithm is proposed to address this issue, which aims to identify community structures in large-scale networks with limited computing resources. The algorithm constructs a subnetwork by simple random subsampling from the entire network, and then extends the existing spectral clustering to the subnetwork to estimate the community labels for entire network nodes. As a result, for large-scale datasets, the method can be realized even using a personal computer. Moreover, the proposed method can be generalized in a parallel way. Theoretically, under the stochastic block model and its extension, the degree-corrected stochastic block model, the theoretical properties of the subsampling spectral clustering method are correspondingly established. Finally, to illustrate and evaluate the proposed method, a number of simulation studies and two real data analyses are conducted.

黄丹阳:中国人民大学统计学院教授,博士生导师。主持国家自然科学基金面上项目,北京市社会科学基金重点项目等省部级及以上科研课题,入选北京市科协青年人才托举工程,曾获北京市优秀人才培养资助。长期从事复杂网络建模、超高维数据分析、分布式计算等方向的理论研究,以及统计理论研究在中小微企业信用风险评估,企业数字化发展中的应用研究。在Journal of the Royal Statistical Society:Series B (Statistical Methodology),Journal of Econometrics, Journal of Business & Economic Statistics等国内外权威期刊发表论文30余篇。

Haonan Wang

Recent developments for multi-channel factor analysis

As modern data collection techniques evolve, complex and inhomogeneous data are frequently collected from multiple sources with unobserved interference and idiosyncratic noise. Multi-channel factor analysis, introduced by Ramírez et al. (2020), allows for the extraction of low-dimensional latent factors that highlight the commonalities across various channels as well as identify unique structures within each channel. In this talk, we discuss some of the important properties of the MFA, including identifiability and the asymptotic behavior of the quasi-Gaussian maximum likelihood estimators. Furthermore, we extend this framework to model time series data, incorporating both temporal and spatial dependencies.

Haonan Wang received his Ph.D. degree in statistics from the University of North Carolina at Chapel Hill in 2003. Currently, he is a Professor of Statistics at Colorado State University. His research interests are in object-oriented data analysis, functional dynamic modeling of neuron activities, spatial and spatio-temporal modeling, and statistical learning.

Jie Yang

Statistical Models for Categorical Data Analysis

Categorical responses, whose measurement scale consists of a set of categories, arise naturally in many different scientific disciplines. Multinomial logistic models have been widely used in the literature, which cover four kinds of logit models, baseline-category (also known as multiclass logistic regression model), cumulative, adjacent-categories, and continuation-ratio logit models. We propose a unified multinomial link model for analyzing categorical responses. It not only covers the existing multinomial logistic models and their extensions as special classes, but also allows the observations with NA or Unknown responses to be incorporated as a special category in the data analysis. We provide explicit formulae for computing the likelihood gradient and Fisher information matrix, as well as detailed algorithms for finding the maximum likelihood estimates of the model parameters. Our algorithms solve the infeasibility issue of existing statistics software on estimating parameters of cumulative link models. The applications to real datasets show that the proposed multinomial link models can fit the data significantly bette# and the corresponding data analysis may correct the misleading conclusions due to missing data.

杨杰,美国伊利诺伊大学芝加哥分校(University of Illinois at Chicago)数学、统计和计算机科学系教授。2001年获南开大学金融数学博士学位,2006年获美国芝加哥大学统计博士学位;多年从事统计学、金融数学、生物信息学、大数据统计分析的教学科研工作;研究成果包括生物大分子快速分类方法、 高维数据统计分类方法、 最优实验设计理论及应用、金融衍生品实时定价方法和大数据抽样分析方法等。

Ping Ma

Statistical Computing Meets Quantum Computing

The recent breakthroughs in quantum computers have shown quantum advantage (aka quantum supremacy), i.e., quantum computers outperform classic computers for solving a specific problem. These problems are highly physics-oriented. A more relevant fact is that there are already general-purpose programmable quantum computing devices available to the public. A natural question for statisticians is whether these computers will benefit statisticians in solving some statistics or data science problems. If the answer is yes, what kind of statistics problems should statisticians resort to quantum computers? Unfortunately, the general answer to this question remains elusive.
In this talk, I will present challenges and opportunities for developing quantum algorithms. I will introduce a novel quantum algorithm for a statistical problem and demonstrate that the intersection of statistical computing and quantum computing is an exciting and promising research area. The development of quantum algorithms for statistical problems will not only advance the field of quantum computing but also provide new tools and insights for solving challenging statistical problems.

Professor Ma is a Distinguished Research Professor in the Department of Statistics at the University of Georgia and co-director of the big data analytics lab. He was a Beckman Fellow at the Center for Advanced Study at the University of Illinois at Urbana-Champaign, a Faculty Fellow at the US National Center for Supercomputing Applications, and a recipient of the National Science Foundation CAREER Award. His paper won the best paper award from the Canadian Journal of Statistics in 2011. He delivered the 2021 National Science Foundation Distinguished Lecture. Professor Ma serves on multiple editorial boards. He is a Fellow of the American Association for the Advancement of Science and the American Statistical Association.

Sangbum Choi

Interval-censored linear quantile regression

Censored quantile regression has emerged as a prominent alternative to classical Cox’s proportional hazards model or accelerated failure time model in both theoretical and applied statistics. While quantile regression has been extensively studied for right-censored survival data, methodologies for analyzing interval-censored data remain limited in the survival analysis literature. This paper introduces a novel local weighting approach for estimating linear censored quantile regression, specifically tailored to handle diverse forms of interval-censored survival data. The estimation equation and the corresponding convex objective function for the regression parameter can be constructed as a weighted average of quantile loss contributions at two interval endpoints. The weighting components are nonparametrically estimated using local kernel smoothing or ensemble machine learning techniques. To estimate the nonparametric distribution mass for interval-censored data, a modified EM algorithm for nonparametric maximum likelihood estimation is employed by introducing subject-specific latent Poisson variables. The proposed method’s empirical performance is demonstrated through extensive simulation studies and real data analyses of two HIV/AIDS datasets.

Dr. Sangbum Choi received his Ph.D degree of Statistics in 2010 from University of Wisconsin at Madison. He was an assistant professor in Biostatistics at The University of Texas Health Science Center at Houston and now he is a full professor in Statistics at Korea University. His research interest covers semiparametric methods in survival analysis, joint modeling, longitudinal data analysis and actuarial data science.

Wu Wang

A Stock Price Trend Prediction Model Based on Supply Chain Matrix

This work explores the integration of industry chain network matrices into graph neural network models to enhance the predictive ability of deep learning factors for future stock returns. Historically, subjective investors have predominantly utilized industry chain analysis methods but have been constrained by data limitations, preventing their full utilization in quantitative investment. With natural language processing technology's maturation, data providers can extract relationships between companies and products from annual reports, combining expert knowledge to construct industry chain upstream and downstream relationships. Based on this foundation, we compute a matrix of interrelatedness between listed companies derived from the industry chain. Subsequently, this matrix is introduced into the graph neural network model as prior information. Experimental results demonstrate that our proposed model outperforms the baseline GRU model in terms of predictive performance on the test set, with significantly increased IC mean values and decreased IC standard deviations. This finding is consistent with existing research, while the differences in the stock pool and graph structure information selected in this study contribute as a supplement to the field. Additionally, this research extensively explores and explains the model structure, lookback periods, training labels, and other factors through numerous experiments.

王武,中国人民大学数理统计系讲师,复旦大学数理统计博士。主要研究方向是函数型数据分析、空间数据分析、机器学习和深度学习方法在能源、工业领域的应用等。成果发表于Biometrics,Scandinavian Journal of Statistics等期刊。

Jie Li

Testing conditional quantile independence with functional covariate

We propose a new nonparametric conditional independence test for a scalar response and a functional covariate over a continuum of quantile levels. We build a Cramer–von Mises-type test statistic based on an empirical process indexed by random projections of the functional covariate, effectively avoiding the “curse of dimensionality” under the projected hypothesis which is almost surely equivalent to the null hypothesis. The asymptotic null distribution of the proposed test statistic is obtained under some mild assumptions. The asymptotic global and local power properties of our test statistic are then investigated. We specifically demonstrate that the statistic is able to detect a broad class of local alternatives converging to the null at the parametric rate. Additionally, we recommend a simple multiplier bootstrap approach for estimating the critical values. The finite-sample performance of our statistic is examined through a number of Monte Carlo simulation experiments. Finally, an analysis of an EEG data set is used to show the utility and versatility of our proposed test statistic.

李杰,中国人民大学统计学院讲师。2022年毕业于清华大学获统计学博士学位。主要研究方向为函数型数据分析和时间序列分析。目前主持国家自然科学基金青年项目,中国博士后科学基金面上项目。在Biometrics、Statistica Sinica等期刊发表论文多篇。

Zerui Guo

Unified Principal Components Analysis of Irregularly Observed Functional Time Series

Irregularly observed functional time series (FTS) are increasingly available in many real-world applications. To analyze FTS, it's crucial to account for both serial dependencies and the irregularly observed nature of functional data. Howeve# existing methods for FTS often rely on specific model assumptions in capturing serial dependencies, or cannot handle the irregular observational scheme of functional data.To solve these issues, one can perform dimension reduction on FTS via functional principal component analysis (FPCA) or dynamic FPCA. Nonetheless, these two methods may either be not theoretically optimal or too redundant to represent serially dependent functional data. In this article, we introduce a novel dimension reduction method for FTS based on the framework of dynamic FPCA.Through a new concept called optimal functional filters, we unify the theories of FPCA and dynamic FPCA, providing a parsimony and optimal representation for FTS adapting to its serial dependence structure. This framework is referred to as principal analysis via dependency-adaptivity (PADA). Under a hierarchical Bayesian model, we establish an estimation procedure for dimension reduction via PADA. Our method can be used for both sparsely and densely observed FTS, and is capable of predicting future functional data. We investigate the theoretical properties of PADA and demonstrate its effectiveness through extensive simulation studies. Finally, we illustrate our method via dimension reduction and prediction of daily PM2.5 data.

郭泽睿,中山大学数学学院博士生,主要研究领域为函数型数据分析、流行病建模等。相关成果发表于European Journal of Epidemiology、中国预防医学杂志等国内外期刊。

Qin Shao

Forecasting Interval for Autoregressive Time Series with trend

We propose a kernel distribution estimator (KDE) for the cumulative distribution function of Autoregressive Time Series with trend. We show that under certain assumptions, this estimator is as efficient as an infeasible KDE that assumes the trend is known. The oracular KDE is used to estimate the quantiles on which a forecasting interval is constructed. Simulation studies confirm the asymptotic properties of the KDE estimator. To illustrate the method, we apply it to monthly average hourly wages data.

Dr. Qin Shao obtained her bachelor's and master's degrees from Nankai University in 1990 and 1993, respectively. In 1997 she entered the doctoral program in Statistics at the University of Georgia. Upon graduating in 2002, she took up a tenure-track position as Assistant Professor of Statistics at the University of Toledo. She achieved the rank of Professor in 2013. Her research interests encompass both the methodology and applications of statistics. One of her major research interests has been concerned with semi-parametric time series modeling. In addition, she have been always interested in using statistics to address important issues in society.

Mengyu Xu

Inference for Quantile Change Points in High-Dimensional Time Series

Change-point detection methods that are based on quantiles can effectively detect changes in extreme values. In this study, we propose a novel change-point detection scheme that utilizes fixed quantiles of moving sums from high-dimensional time series data. Our approach employs a moving sum (MOSUM) test statistic that aggregates the component series by the $\ell^{\infty}$ norm. We investigate the asymptotic properties of the proposed test statistic in the context of weak temporal dependent high-dimensional time series, while also allowing for strong and weak cross-sectional dependence. Our analysis relies on a powerful uniform Bahadur representation result. Specifically, we extend the existing uniform Bahadur representation to the high-dimensional setting for dependent data. A simulation study demonstrates the effectiveness of our approach.

Mengyu Xu received the Bachelor’s Degree in Statistics from Renmin University of China, Beijing, China in 2010. She received the M.S. and Ph.D. degree from the Department of Statistics in the University of Chicago, Chicago, USA in 2012 and 2016. Her research interests include the covariance matrix estimation and time-varying network recovery from high-dimensional time series and the distribution theory of quadratic forms and high-dimensional hypotheses test.

Feng Zhou

Accelerating Convergence in Bayesian Few-Shot Classification

Bayesian few-shot classification has been a focal point in the field of few-shot learning. This paper seamlessly integrates mirror descent-based variational inference into Gaussian process-based few-shot classification, addressing the challenge of non-conjugate inference. By leveraging non-Euclidean geometry, mirror descent achieves accelerated convergence by providing the steepest descent direction along the corresponding manifold. It also exhibits the parameterization invariance property concerning the variational distribution. Experimental results demonstrate competitive classification accuracy, improved uncertainty quantification, and faster convergence compared to baseline models. Additionally, we investigate the impact of hyperparameters and components.

周峰,中国人民大学统计学院讲师,中国人民大学杰出青年学者。主持国家自然科学基金青年项目,中国博士后基金特别资助、面上资助,入选博士后国际交流计划引进项目。主要研究方向包括统计机器学习、贝叶斯方法、随机过程、时空数据分析等。在JMLR、MLJ、STCO、NeurIPS、ICLR、AAAI、AISTATS等期刊/会议发表论文20余篇。

Zhibo Cai

A Variable Selection Tree and Its Random Forest

A novel screening approach is proposed by partitioning the sample into subsets sequentially and creating a tree-like structure of sub-samples called the SIS-tree. SIS-tree is straightforward to implement and can be integrated with various measures of dependence. Theoretical results are established to support this approach, including its ``sure screening property". Additionally, SIS-tree is extended to a forest with improved performance. Through simulations, the proposed methods are demonstrated to have great improvement comparing with existing SIS methods. The selection of a cutoff for the screening is also investigated through theoretical justification and experimental study. As a direct application of the screening, the classification of high-dimensional data is considered, and it is found that the ranking and cutoff can substantially improve the performance of existing classifiers.

蔡智博现任中国人民大学统计学院数据科学与大数据统计系讲师,主要研究兴趣包括充分降维、变量选择及其在机器学习中的应用,生成式人工智能的理论与应用研究。学术论文在JASA、NeurIPS、ICLR等学术期刊和会议上发表。

Xinyue Wang

U.S.-U.K. PETs Prize Challenge: Anomaly Detection via Privacy-Enhanced Federated Learning

Privacy Enhancing Technologies (PETs) have the potential to enable collaborative analytics without compromising privacy. This is important for collaborative analytics can allow us to really extract value from the large amounts of data that are collected in domains such as healthcare, finance, and national security, among others. In order to foster innovation and move PETs from the research labs to actual deployment, the U.S. and U.K. governments partnered together in 2021 to propose the PETs prize challenge asking for privacy-enhancing solutions for two of the biggest problems facing us today: financial crime prevention and pandemic response. This article presents the Rutgers ScarletPets privacy-preserving federated learning approach to identify anomalous financial transactions in a payment network system (PNS). This approach utilizes a two-step anomaly detection methodology to solve the problem. In the first step, features are mined based on account-level data and labels, and then a privacy-preserving encoding scheme is used to augment these features to the data held by the PNS. In the second step, the PNS learns a highly accurate classifier from the augmented data. Our proposed approach has two major advantages: 1) there is no noteworthy drop in accuracy between the federated and the centralized setting, and 2) our approach is flexible since the PNS can keep improving its model and features to build a better classifier without imposing any additional computational or privacy burden on the banks. Notably, our solution won the first prize in the US for its privacy, utility, efficiency, and flexibility.

Xinyue Wang received her Ph.D. from Rutgers University in Newark, NJ, USA. Her research interests lie in the interdisciplinary areas of data privacy and security, deep learning, and their applications in various fields such as bioinformatics and finance.

Jiancheng Jiang

Partition-Insensitive Parallel ADMM Algorithm for High-dimensional Linear Models

The parallel alternating direction method of multipliers (ADMM) algorithms have gained popularity in statistics and machine learning due to their efficient handling of large sample data problems. Howeve# the parallel structure of these algorithms, based on the consensus problem, can lead to an excessive number of auxiliary variables when applied to highdimensional data, resulting in large computational burden. In this pape# we propose a partition-insensitive parallel framework based on the linearized ADMM (LADMM) algorithm and apply it to solve nonconvex penalized high-dimensional regression problems. Compared to existing parallel ADMM algorithms, our algorithm does not rely on the consensus problem, resulting in a significant reduction in the number of variables that need to be updated at each iteration. It is worth noting that the solution of our algorithm remains largely unchanged regardless of how the total sample is divided, which is known as partition-insensitivity. Furthermore, under some mild assumptions, we prove the convergence of the iterative sequence generated by our parallel algorithm. Numerical experiments on synthetic and real datasets demonstrate the feasibility and validity of the proposed algorithm. We provide a publicly
available R software package to facilitate the implementation of the proposed algorithm.

Dr. Jiancheng Jiang is Professor of statistics at the Department of mathematics and Statistics & School of data Science, University of North Carolina at Charlotte. His research interest includes Financial Econometrics, Theoretical and Applied Statistics, Biostatistics, and Data Science.

Sangwook Kang

Deep Neural Network-based Accelerated Failure Time Models Using Rank Loss

An accelerated failure time (AFT) model assumes a log-linear relationship between failure times and a set of covariates. In contrast to other popular survival models that work on hazard functions, the effects of covariates are directly on failure times, the interpretation of which is intuitive. The semiparametric AFT model that does not specify the error distribution is
sufficiently flexible and robust to depart from the distributional assumption. Owing to its desirable features, this class of model has been considered a promising alternative to the popular Cox model in the analysis of censored failure time data. However, in these AFT models, a linear predictor for the mean is typically assumed. Little research has addressed the non-linearity of predictors when modeling the mean. Deep neural networks (DNNs) have received much attention over the past few decades and have achieved remarkable success in a variety of fields. DNNs have a number of notable advantages and have been shown to be particularly useful in addressing non-linearity. Here, we propose applying a DNN to fit AFT models using Gehan-type loss combined with a sub-sampling technique. Finite sample properties of the proposed DNN and rank-based AFT model (DeepR- AFT) were investigated via an extensive simulation study. The DeepR-AFT model showed superior performance over its parametric and semiparametric counterparts when the predictor was non-linear. For linear predictors, DeepR-AFT performed better when the dimensions of the covariates were large. The superior performance of the proposed DeepR-AFT was demonstrated using three real datasets.

BS in Statistics, Seoul National University, South Korea, 2001 PhD in Biostatistics, University of North Carolina at Chapel Hill, 2007 Assistant Professor, University of Georgia, US (2007-2010), University of Connecticut, US (2010 - 2013) Assistant, Associate, Full Professor, Yonsei University, South Korea (2013 - )

成慧敏

Network Tight Community Detection

Conventional community detection methods often categorize all nodes into clusters. Howeve# the presumed community structure of interest may only be valid for a subset of nodes (named as “tight nodes”), while the rest of the network may consist of noninformative “scattered nodes”. For example, a protein-protein network often contains proteins that do not belong to specific biological functional modules but are involved in more general processes, or act as bridges between different functional modules. Forcing each of these proteins into a single cluster introduces unwanted biases and obscures the underlying biological implication. To address this issue, we propose a tight community detection (TCD) method to identify tight communities excluding scattered nodes. The algorithm enjoys a strong theoretical guarantee of tight node identification accuracy and is scalable for large networks. The superiority of the proposed method is demonstrated by various synthetic and real experiments.

I am an Assistant Professor in the Department of Biostatistics at Boston University. I am affiliated with the Rafik B. Hariri Institute for Computing and Computational Science Engineering and Nanotechnology Innovation Center at Boston University. I received my Ph.D. in statistics from the University of Georgia in 2023.

赵博娟

Two variable screening procedures with restrictions on the positive or negative effects

In this paper, two variable screening procedures, the local significant forward and backward procedure with restrictions on the positive or negative effects (FBRPN) and the backward procedure with restrictions on the positive or negative effects (BRPN), are proposed to obtain meaningful protective and risk factors in fast and sequential ways in models with a linear component such as the GLMs to avoid multicollinearity. The two fitted models from the procedures are compared to obtain the most efficient model and the representative variables of the original predictors. The new procedures are compared with stepwise and best subsets regression in three illustration examples. Simulation studies are carried out to get some insights on how different covariance structures can affect the final fitted models obtained from the procedures. Cross-validation comparisons with stepwise, LASSO and LAR methods are made based on the Efron diabetes data. Finally, practical issues are discussed, and applications of the new procedures in big data analysis are envisioned.

参会申请人毕业于南开大学数学系(数理统计专业, 博士),曾在美国美国南卫理公会大学(Southern Methodist University)和美国哈佛大学 (Harvard School of Public Health)做过博士后研究, 在美国,Meharry Medical College工作,现在天津财经大学工作(教授、博导)。

沈梓梁

分布式高维分位数回归:估计效率和支撑恢复

本报告深入探讨了高维线性分位数回归问题中的分布式估计和支持恢复技术。分位数回归作为一种对异常值和数据异质性具有较强鲁棒性的最小二乘回归替代方法,已获得广泛应用。然而,其检查损失函数的非平滑特性,在分布式计算和理论分析中带来了重大挑战。为了克服这些难题,我们提出了一种创新的转换策略,将分位数回归问题转化为最小二乘优化问题。本报告中,我们采用了双平滑技术,对先前牛顿型分布式方法进行了扩展,消除了对误差项与协变量之间独立性的严格假设。
我们开发了一种高效的算法,该算法在计算和通信效率方面表现出色。从理论上讲,我们提出的分布式估计器在经过一定数量的迭代后,能够达到接近最优的收敛速度,并实现高准确度的支持恢复。
此外,本报告还通过在合成数据和真实数据集上的广泛实验,进一步验证了所提出方法的有效性。实验结果表明,我们的方法在处理高维数据时,不仅能够提供准确的估计,还能有效地恢复数据中的关键支持结构。
总体而言,本报告为高维分位数回归的分布式估计和支持恢复提供了一种新的视角和解决方案,具有重要的理论和实际应用价值。

我目前在上海财经大学统计与管理学院攻读统计学博士学位。我的导师是王绍立副教授。我对统计学和机器学习理论充满热情,特别分布式计算领域。此前,我曾在南昌大学获得学士学位。

师佳鑫

Mixture Conditional Regression with Ultrahigh Dimensional Text Data for Estimating Extralegal Factor Effects

Testing judicial impartiality is a problem of fundamental importance in empirical legal studies, for which standard regression methods have been popularly used to estimate the extralegal factor effects. Howeve# those methods cannot handle control variables with ultrahigh dimensionality, such as those found in judgment documents recorded in text format. To solve this problem, we develop a novel mixture conditional regression (MCR) approach, assuming that the whole sample can be classified into a number of latent classes. Within each latent class, a standard linear regression model can be used to model the relationship between the response and a key feature vecto# which is assumed to be of a fixed dimension. Meanwhile, ultrahigh dimensional control variables are then used to determine the latent class membership, where a na\"ive Bayes type model is used to describe the relationship. Hence, the dimension of control variables is allowed to be arbitrarily high. A novel expectation-maximization algorithm is developed for model estimation. Therefore, we are able to estimate the key parameters of interest as efficiently as if the true class membership were known in advance. Simulation studies are presented to demonstrate the proposed MCR method. A real dataset of Chinese burglary offenses is analyzed for illustration purposes.

师佳鑫,北京大学光华管理学院商务统计与经济计量系在读博士生。主要研究方向为高维数据中的潜在结构分析,因子模型,计算法学,复杂网络数据分析等。研究论文被Annals of Applied Statistics期刊接收。

余柏辰

A Gaussian Mixture Model for Multiple Instance Learning with Partially Subsampled Instances

Multiple instance learning is a powerful machine learning technique, which is found useful when numerous instances can be naturally grouped into different bags. Accordingly, a bag-level label can be created for each bag according to whether the instances contained in the bag are all negative or not. Thereafte# how to train a statistical model with bag-level labels with/without partially labeled instances becomes the problem of great interest. To this end, we develop a Gaussian mixture model (GMM) framework to describe the stochastic behavior of the instance-level feature vectors. Both the instance-based maximum likelihood estimator (IMLE) and the bag-based maximum likelihood estimator (BMLE) are theoretically investigated. We found that the statistical efficiency of the IMLE could be much better than that of the BMLE, if the instance-level labels are relatively hard to be predicted. To fix the problem, we develop here a subsampling-based maximum likelihood estimation (SMLE) approach, where the instance-level labels are partially provided through carefully subsampling. This leads to a significantly reduced labeling cost with little sacrifice in terms of statistical efficiency. To demonstrate the finite sample performance, extensive simulation studies are presented. A real data example using whole-slide images (WSIs) to diagnose metastatic breast cancer is illustrated.

余柏辰,北京大学光华管理学院商务统计与经济计量系在读博士生,师从王汉生教授。本科毕业于华东师范大学统计学院。主要研究方向为图像数据分析、高维数据分析等。

李雪曈

Gaussian Mixture Model with Rare Event

We study here a Gaussian Mixture Model (GMM) with rare events data. In this case, the commonly used Expectation-Maximization (EM) algorithm exhibits extremely slow nu_x0002_merical convergence rate. To theoretically understand this phenomenon, we formulate the numerical convergence problem of the EM algorithm with rare events data as a problem about a contraction operator. Theoretical analysis reveals that the spectral radius of the contraction operator in this case could be arbitrarily close to 1 asymptotically. This theo_x0002_retical finding explains the empirical slow numerical convergence of the EM algorithm with rare events data. To overcome this challenge, a Mixed EM (MEM) algorithm is developed, which utilizes the information provided by partially labeled data. As compared with the standard EM algorithm, the key feature of the MEM algorithm is that it requires addi_x0002_tionally labeled data. We find that MEM algorithm significantly improves the numerical convergence rate as compared with the standard EM algorithm. The finite sample perfor_x0002_mance of the proposed method is illustrated by both simulation studies and a real-world dataset of Swedish traffic signs.

李雪曈,北京大学光华管理学院商务统计与经济计量系在读博士生,师从王汉生教授。主要研究方向包括非均衡数据分析,网络结构数据分析,分布式计算等。研究论文发表在Statistica Sinica,Electronic Journal of Statistics期刊上。

李忻月

Functional Adaptive Double-Sparsity Estimator for High-Dimensional Sensor Data Analysis

Wearable sensors have been increasingly used in health monitoring and early anomaly detection. Wearable device can collect objective and continuous information on physical activity and vital signs and have great potentials in studying the association with health outcomes. Howeve# how to effectively analyze high-frequency multi-dimensional sensor data is challenging. In this talk, we propose a new Functional Adaptive Double-Sparsity Estimator (FadDoS) based on functional regularization of sparse group lasso with multiple functional predictors, which can achieve global sparsity via functional variable selection and local sparsity via zero-subinterval identification within coefficient functions. We prove that the FadDoS estimator converges at a bounded rate and satisfies the oracle property under mild conditions. Extensive simulation studies confirm the theoretical properties and exhibit excellent performances compared to existing approaches. We applied FadDoS to a Kinect sensor study that utilized an advanced motion sensing device tracking human multiple joint movements and conducted among community-dwelling elderly, and we demonstrated how FadDoS can effectively characterize the detailed association between joint movements and physical health assessments. The proposed method is not only effective in Kinect sensor analysis but also applicable to broader fields where multi-dimensional sensor signals are collected simultaneously. The R code for FadDoS is available at https://github.com/Cheng-0621/FadDoS.

Prof. Li received her PhD in Biostatistics from Yale University. Prior to Yale University, she spent one year at Peking University and three years at the University of Chicago, receiving her B.A. and M.S. in Statistics from the University of Chicago. Prof. Li’s research focuses on statistical methods for wearable device data, medical imaging data, large population studies, and precision medicine. Her research papers were published in high-impact journals, such as The Lancet, JAMA Network Open, Advanced Science, IEEE Internet of Things Journal, NPJ Digital Medicine, and Statistica Sinica. Prof. Li has established collaboration with China, Europe and US to join international efforts in developing statistical methods for analyzing wearable sensor data in large population health studies.

罗翔宇

Bayesian Integrative Region Segmentation in Spatially Resolved Transcriptomic Studies

The spatially resolved transcriptomic study is a recently developed biological experiment that can measure gene expressions and retain spatial information simultaneously, opening a new avenue to characterize fine-grained tissue structures. In this article, we propose a nonparametric Bayesian method named BINRES to carry out the region segmentation for a tissue section by integrating all the three types of data generated during the study—gene expressions, spatial coordinates, and the histology image. BINRES is able to capture more subtle regions than existing statistical partitioning models that only partially make use of the three data modes and is more interpretable than neural-network-based region segmentation approaches. Specifically, due to a nonparametric spatial prio# BINRES does not require a prespecified region number and can learn it automatically. BINRES also combines the image and the gene expressions in the Bayesian consensus clustering framework and thus flexibly adjusts their label alignment contribution weights in a data-adaptive manner. A computationally scalable extension is developed for large-scale studies. Both simulation studies and the real application to three mouse spatial transcriptomic datasets demonstrate that BINRES outperforms the competing methods and easily achieves the uncertainty quantification of the integrative partition.

罗翔宇2018年9月起任职于中国人民大学统计与大数据研究院,现为准聘副教授。他2018年博士毕业于香港中文大学统计系。罗翔宇的研究兴趣包括贝叶斯统计、非参数贝叶斯、生物信息学、统计计算等。他热衷于开发新的统计模型来解决实际中的生物问题。其具体研究方向包括利用统计图模型构建基因调控或共表达网络、纠正高通量数据中的批次效应、对于批量层次的基因表达或DNA甲基化数据进行去卷积化、发现单细胞分辨率上的个体异质性、空间转录组及多组学数据融合分析等。

孙韬

Enhancing Treatment Strategies and Risk Assessment in Hip Fracture Elderly Patients: A Copula-Based Approach for Semi-Competing Risks Analysis

Hip fracture is a severe complication in the elderly. The affected people are at a higher risk of second fracture and death occurrence, and the best treatment for hip fractures is still being debated. Aside from the treatment, many factors, such as comorbidity conditions, may be associated with second fracture and death occurrence. This study aims to identify effective treatments and important covariates and estimate their effects on the progression of second fracture and death occurrence in hip fracture elderly patients using the semi-competing risks framework, because death dependently censors a second fracture but not vice versa. Due to the complex semi-competing risks data, performing variable selection simultaneously for second fracture and death occurrence is difficult. We propose a penalised semi-parametric copula method for semi-competing risks data. Specifically, we use separate Cox semi-parametric models for both margins and employ a copula to model the two margins’ dependence. We apply the proposed method to a population-based cohort study of hip fracture elderly patients, providing new insights into their treatment and clinical management.

孙韬,中国人民大学统计学院副教授,博士毕业于匹兹堡大学生物统计系,主要研究方向为复杂生存数据模型,老年失能风险管理。

梅好

Network and Covariate Adjusted Response-Adaptive Design

Randomization is a distinguishing feature of clinical trials for unbiased assessment of treatment efficacy. With a growing demand for more flexible and efficient randomization schemes and motivated by the idea of adaptive design, in this article we propose the network and covariate adjusted response-adaptive (NCARA) design that can concurrently manage three challenges: 1) maximizing benefits of a trial by assigning more patients to the superior treatment group randomly; 2) balancing social network ties across treatment arms to eliminate potential network interference; and 3) ensuring balance of important covariates, such as age, gende# and other potential confounders. We conduct simulation with different network structures and a variety of parameter settings. It is observed that the NCARA design outperforms four alternative randomization designs in solving the above-mentioned problems and has comparable power and type I error for detecting true difference between treatment groups. In addition, we conduct real data analysis to implement the new design in two clinical trials. Compared to equal randomization (the original design utilized in the trials), the NCARA design slightly increases powe# largely increases the percentage of patients assigned to the better-performing group, and significantly improves network and covariate balances. It is also noted that the advantages of the NCARA design are augmented when the sample size is small and the level of network interference is high. In summary, the proposed NCARA design assists researchers in conducting clinical trials with high-quality and high-efficiency.

梅好,中国人民大学统计学院讲师,中国人民大学杰出青年学者,2021年博士毕业于耶鲁大学,曾就职于耶鲁纽黑文医院临床实效研究中心,腾讯医疗健康事业部。主要研究方向为网络数据分析、生存分析、复杂数据建模等统计学方法及其在在医疗健康、决策预测等领域的应用。在Biometrics, Statistics in Medicine, Annals of Emergency Medicine, BMC health services research等期刊发表论文十余篇,总引用量300次以上。

段晓丽

What happens when your validated ecosystem is a Graph?

A validated environment to use R to develop clinical trials reporting tools and deliver reproducible data analytic results (i.e. table, listing, and figure outputs) for regulatory submission is a must. The Comprehensive R Archive Network (CRAN) which sets up the highest standard of validating a new/upgraded package assesses the cohort of package reverse dependencies upon submission and evaluates if the package continues to serve as expected as a dependency in the current validated ecosystem. Indeed, the evaluation of the heaviness of package dependencies and the risk of inter-dependency impacting reproducibility is a complex process, given that active package up-versioning and data standards publications make our Auto-validation R Submission Portal a dynamic system/network on a daily basis.

Our goal is to effortlessly touch the comprehensive review of a validated ecosystem’s all available package dependencies via a directed Graph - a non-linear data structure in graph theory - and simplify the validation task workflow in terms of computational complexity. We will
(1) linearly traverse/search and visualize package dependencies within a user-defined scope,
(2) linearly order/schedule pending packages to be validated in the queue and automatically trigger which is the next to be performed in the validation pipelines to minimize any newly broken package behaviors due to package upgrades, and
(3) automatically make package owners/maintainers notified if their package dependencies get upgraded up to certain versions by any other package requests, which will make the package re-submitted for validation again (but thinking about this is a heads-up of potential test failures due to package up-versioning).

Our demos will cover three CRAN-released clinical trial analysis tools: tern (Roche), tidytlg (J&J), and forestly (Merck). Note that our proposed framework can be generalized for any complex dataflow system, regardless of performing tasks, programming languages, package managers, etc. The dynamic QC (for results) process (and data dependencies) can also be supported if we provide an end-to-end R solution to clinical reporting in a centralized platform.

Xiaoli Duan has been a Data Scientist in Roche PD Data Sciences since she received her Ph.D. degree in Industrial Engineering in 2022, with a research focus on statistical machine learning in healthcare. She is an R developer of the NEST project (chevron family) and a Python developer of automatic tumor segmentation algorithms. She is a product owner of the R interface to Roche’s distributed ecosystem across multiple semantic platforms.

程鼎

Integrating LLM Coding Capabilities in End-to-End Data Science: Challenges and Reflections

This presentation will explore the integration of large language model (LLM) coding capabilities within the end-to-end data science workflow. Using a case study of constructing a Chat Dashboard, we will delve into the challenges, insights, and reflections encountered throughout the process. The focus will be on development within the R programming environment, highlighting the application of statistical models to enhance data analysis and decision-making. The presentation will cover technical implementation details and share experiences in project management and interdisciplinary collaboration, providing practical guidance for professionals looking to leverage LLM advantages in the data science field.

Ding Cheng is currently working at AbbVie - Allergan Aesthetics, where he is responsible for commercial and business-related data analysis and modeling. With extensive experience in clinical research development, IT and innovation, and business intelligence, Ding is passionate about integrating advanced digital technologies with medical practices to drive improvements in the healthcare industry.

曹心怡

Patient Narrative Generation in R

The Patient Narrative, or Adverse Event narrative, is critical in clinical trials for providing detailed safety data. Its distinctive features include patient-generated content and presentation in chronological order. However its creation involves tedious tasks like data retrieval and event timeline linking. The use of R for the automated generation of patient narrative reports significantly saves the resources in data collection and repetitive writing tasks, offering a notable improvement in accuracy compared to manual methods. This presentation will primarily focus on how to generate Narratives using R, along with the usage of the current popular R packages. Moreove# it will explore the potential for further automating Narrative generation in R.

Zoe Cao
Statistical Programmer Employed at Simcere Zaiming Pharmaceutical Company
Graduated from the University of British Columbia with majors in Statistics and Economics

曹心怡
统计程序员,就职于先声再明医药有限公司
毕业于英属哥伦比亚大学,主修统计学与经济学

王杰,刘晓畅

双剑合璧: R和Python协同构建数据应用

R与Python是构建数据科学应用的过程中必不可少的重要工具,诚然,它们有着各独特的优势:R 以其强大的统计分析能力和数据可视化功能闻名,而Python则以其易读性和广泛的库的支持在数据处理和机器学习领域中占据一席之地。对于一个完整的数据科学项目,R与Python并非互斥的关系。我们可以通过结合它们各自的优势,在开发过程中实现协同效应。从需求出发,灵活选择工具,这样我们将极大地提高开发数据应用的速度,以及赋予应用一定程度的鲁棒性。我们将详细介绍这种协同工作的实践过程,以及如何最大限度地利用R和Python的优势,为数据科学家提供一个全新的构建数据应用的视角。

对于大部分的数据清洗,可视化以及前端交互的需求,我们选择使用R作为我们的工具。其中,我们选择了R Shiny作为前端交互的框架并使用了R golem作为开发R Shiny应用的框架。对于一些特定的功能,我们利用Python的优势,通过FastAPI搭建接口为前端应用提供功能的实现。此外,我们也借助了微软提供的Graph API,以此来丰富应用的功能。

在具体的实践过程中,需要根据项目的需求和团队的技术能力来选择合适的工具和框架。通过合理地利用R和Python的协同构建,可以开发出高效、灵活和功能强大的数据应用,为数据科学工作提供更多可能性和创新空间。

王杰,是强生创新制药中国研发临床统计编程部门技术解决方案的数据工程师。他是一位技术娴熟的统计程序员,专注于发现机会,推动优化和创新,应用传统和前沿的方法进行临床相关的数据分析。他拥有12年生物制药数据分析经验, 在加入强生之前, 曾在辉瑞研发工作过5年多从事临床数据分析相关工作。刘晓畅目前任职于Johnson & Johnson的临床与统计编程部门,担任的是Data Engineer的工作。他擅长使用R,Python以及其他编程语言和工具,在处理大规模临床数据、数据挖掘、机器学习,数据可视化和生成式人工智能应用方面具有丰富的经验和技能。他的工作目标是利用数据驱动的解决方案来支持临床研究和决策。他于2018年获得山东大学药学学士学位,2019年获得英国爱丁堡大学药物发现与转化生物学硕士学位。

张春明

Simultaneous jump detection for multiple sequences via screening and multiple testing

The estimation of nonparametric discontinuous regression function is fundamental in many applied fields, but challenges arise when the number of jumps (or discontinuities) is large and unknown. We propose a new jump detection method, via the consecutive screening and multiple testing (SaMT) algorithm for estimating the unknown jump points in the flexible non-parametric regression model, guaranteeing the desired accuracy. The initial jump candidates are obtained in the consecutive screening procedure combined with locally-linear smoothing method. To further assess the significance of an individual jump candidate, we develop a novel test based on the profile likelihood inference. The ultimate selection of relevant jump points is conducted in multiple testing procedure, which rules out irrelevant jump points with large variations, due to heteroscedastic errors, from jump candidates. Moreove# we generalize the proposed SaMT algorithm to detect the common jump points shared across multiple aligned sequences. The proposed method is easy to implement, enjoys flexibility in choices of bandwidth parameter and threshold quantity in screening, and is illustrated through simulations and real data examples, as compared with existing methods.

Chunming Zhang is a Professor in the Department of Statistics at the University of Wisconsin-Madison. She earned her Ph.D. in Statistics from the University of North Carolina at Chapel Hill under the guidance of Jianqing Fan. She completed her B.S. in mathematical statistics at Nankai University, Tianjin, China, and an M.S. in Computational Mathematics from Academia Sinica, Beijing, China. Her research interests range from statistical learning and data mining, statistical methods with applications to imaging data, neuroinformatics, and bioinformatics, multiple testing, large-scale simultaneous inference and applications, statistical methods in financial econometrics, non- and semi-parametric estimation and inference, to functional and longitudinal data analysis. Her current research topics include new developments in the area of large-scale structure learning tasks and statistical inference procedures, with applications in neuroscience, biology, machine learning, and causal inference. She is an elected Fellow (2016) of the American Statistical Association (ASA) and an elected Fellow (2011) of the Institute of Mathematical Statistics (IMS) and is honored by a Medallion Award and Lecturer (2024) of the IMS.

马长兴

Common Odds Ratio Test and Interval Estimation for Stratified Bilateral and Unilateral Data

In clinical research, data are commonly collected bilaterally from paired organs or bodily parts within individual subjects. However, unilateral data arise when constraints or limiting factors impede the collection of complete bilateral data. In this paper, we propose three large-sample tests and five confidence interval methods for making inferences on the common treatment effect, measured by the odds ratio, in a stratified design under integrated bilateral and unilateral data. Our simulation results show that the likelihood ratio-based and score-based tests, along with their associated confidence interval methods, demonstrate robust control of type I error and close-to-nominal coverage probabilities. We apply the proposed methods to real-world datasets of acute otitis media and myopic eyes to showcase their validity and applicability in clinical practice.

Changxing Ma, PhD is Associate Professor, Co-Director for Master of Public Health (MPH) Biostatistics in the Department of Biostatistics at the University at Buffalo. He graduated from Nankai University in 1997. Before joining in Biostatics University at Buffalo, he worked at Nankai Dept of Statistics from 1992 to 2002, worked with longitudinal and birth cohort’s data in University of Florida for 5 years from 2000 to 2005. He published more than 130 peer-reviewed publications in a wide range of statistical and biomedical journals. His Google scholar h-index is 46, i10-index 95.

刘笑

Assessing heterogeneous causal effects across clusters in partially nested designs

Partially nested designs are common in studies of psychological or behavioral interventions. In this type of design, after participants are assigned to study arms, participants in a treatment arm are subsequently assigned to clusters (e.g., teachers, therapy groups) to receive treatment, whereas participants in a control arm are unclustered (e.g., a wait-list control). As participants in the treatment arm receive treatment in clusters, it is often of interest to examine heterogeneity of treatment effects across the clusters; but this is challenging in the partially nested design. Particularly, in defining a causal effect of treatment for a specific cluster (e.g., a specific therapist), it is unclear how the treatment and control outcomes should be compared, as the control arm has no clustering (e.g., no therapists). It may be tempting to compare outcomes of a specific cluster to outcomes of the entire control arm—howeve# this comparison may not represent a causal effect even when the treatment assignment is randomized, because the cluster assignment in the treatment arm may be nonrandomized (elaborated in this talk). In this talk, I will describe our study that extends the principal stratification framework and the principal score approach to assessing heterogeneous cluster-specific treatment effects in the partially nested design. Besides the effect definition and identification, our study obtains various estimators for the cluster-specific treatment effects, including a multiply-robust estimator that can provide more robustness to parametric model misspecification. In addition to simulation results, I will present an empirical example applying our methods to estimating the heterogeneous treatment effects across clusters in a partially nested design. I will end this talk with a discussion of the implications of our study and potential future directions.

Xiao Liu is an assistant professor in the quantitative methods program of the Department of Educational Psychology at UT Austin. She is interested in causal inference methods, quasi-experimental methods (e.g., propensity score), causal mediation analysis, and longitudinal data analysis.

王春燕

Construction of strong orthogonal Latin hypercubes

Column-orthogonality and space-filling property are perhaps two most desirable design properties for computer experiments. Column-orthogonality allows the estimates of the main effects in linear models to be uncorrelated with each othe# while the space-filling property is appropriate for Gaussian process models. Orthogonal Latin hypercubes are widely used for computer experiments. They achieve both orthogonality and the maximum one-dimensional stratification property. When two-factor (and higher-order) interactions are active, two- and three-dimensional stratifications are also important. Unfortunately, little is known about orthogonal Latin hypercubes with good two (and higher) dimensional stratification properties. This paper proposes a method for constructing a new class of orthogonal Latin hypercubes whose columns can be partitioned into groups, such that the columns from different groups maintain two- and three-dimensional stratification properties. The proposed designs perform well under almost all popular criteria (e.g., the orthogonality, stratification, and maximin distance criterion). They are the ideal designs for computer experiments. The construction method can be straightforward to implement, and the relevant theoretical supports are well established. The proposed strong orthogonal Latin hypercubes are tabulated for practical needs.

王春燕,中国人民大学统计学院讲师,南开大学博士,田纳西大学访问学者,普渡大学博士后助理研究员。研究方向包括统计试验设计、计算机试验、次序添加试验等。相关文章发表在《中国科学 数学》、《Annals of Statistics》、 《Statistica Sinica》等期刊上。