?!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 亚洲αv久久久噜噜噜噜噜,无码国产伦一区二区三区视频,亚洲精品国产一区二区三

亚洲精品92内射,午夜福利院在线观看免费 ,亚洲av中文无码乱人伦在线视色,亚洲国产欧美国产综合在线,亚洲国产精品综合久久2007

?div class="header_top">
Java知识分n|?- L学习(fn)从此开始!    
SpringBoot+SpringSecurity+Vue+ElementPlus权限pȝ实战评 震撼发布        

最新Java全栈׃实战评(免费)

AI人工学习(fn)大礼?/h2>

IDEA怹Ȁz?/h2>

66套java实战评无套路领?/h2>

锋哥开始收Java学员啦!

Python学习(fn)路线?/h2>

锋哥开始收Java学员啦!
当前位置: 主页 > Java文 > 人工AI >

Swin3DQ一个用?D室内场景理解的预先训l的Transformerd PDF 下蝲


分n刎ͼ(x)
旉:2025-05-31 11:01来源:http://sh6999.cn 作?转蝲  侉|举报
Swin3DQ一个用?D室内场景理解的预先训l的Transformerd
失效链接处理
Swin3DQ一个用?D室内场景理解的预先训l的Transformerd  PDF 下蝲

 
 
相关截图Q?/strong>
 

主要内容Q?/strong>
 

 

. Introduction
Pretrained backbones with fine-tuning have been widely
applied to various 2D vision and NLP tasks [132103],
where a backbone network pretrained on a large dataset is
concatenated with task-specific back-end and then fine-tuned
for different downstream tasks. This approach demonstrates
*
Interns at Microsoft Research Asia. †Contact person.
its superior performance and great advantages in reducing
the workload of network design and training, as well as the
amount of labeled data required for different vision tasks.
In the work, we present a pretrained 3D backbone, named
SWIN3D, for 3D indoor scene understanding tasks. Our
method represents the 3D point cloud of an input 3D scene as
sparse voxels in 3D space and adapts the Swin Transformer
[30] designed for regular 2D images to unorganized 3D
points as the 3D backbone. We analyze the key issues that
prevent the na¨\ve 3D extension of Swin Transformer from
exploring large models and achieving high performance,
i.e., the high memory complexitythe ignorance of signal
irregularity. Based on our analysis, we develop a novel
3D self-attention operator to compute the self-attentions of
sparse voxels within each local window, which reduces the
memory cost of self-attention from quadratic to linear with
respect to the number of sparse voxels within a window and
computes efficiently; enhances self-attention via capturing
various signal irregularities by our generalized contextual
relative positional embedding [4826].
The novel design of our SWIN3D backbone enables us to
scale up the backbone model and the amount of data used
for pretraining. To this end, we pretrained a large SWIN3D
model with 60M parameters via a 3D semantic segmenta
tion task over a synthetic 3D indoor scene dataset [60] that
includes 21K rooms and is about ten times larger than the
ScanNet dataset. After pretraining, we cascade the pretrained
SWIN3D backbone with task-specific back-end decoders
and fine-tune the models for various downstream 3D indoor
scene understanding tasks.
 


 

------分隔U?---------------------------
?!-- //底部模板 -->