Figures
Abstract
The training of temporal action localization models relies heavily on a large amount of manually annotated data. Video annotation is more tedious and time-consuming compared with image annotation. Therefore, the semi-supervised method that combines labeled and unlabeled data for joint training has attracted increasing attention from academics and industry. This study proposes a method called pseudo-label refining (PLR) based on the teacher-student framework, which consists of three key components. First, we propose pseudo-label self-refinement which features in a temporal region interesting pooling to improve the boundary accuracy of TAL pseudo label. Second, we design a module named boundary synthesis to further refined temporal interval in pseudo label with multiple inference. Finally, an adaptive weight learning strategy is tailored for progressively learning pseudo labels with different qualities. The method proposed in this study uses ActionFormer and BMN as the detector and achieves significant improvement on the THUMOS14 and ActivityNet v1.3 datasets. The experimental results show that the proposed method significantly improve the localization accuracy compared to other advanced SSTAL methods at a label rate of 10% to 60%. Further ablation experiments show the effectiveness of each module, proving that the PLR method can improve the accuracy of pseudo-labels obtained by teacher model reasoning.
Citation: Meng L, Ban G, Xi G, Guo S (2025) Pseudo label refining for semi-supervised temporal action localization. PLoS ONE 20(2): e0318418. https://doi.org/10.1371/journal.pone.0318418
Editor: Dang N. H. Thanh, University of Economics Ho Chi Minh City, VIET NAM
Received: August 14, 2024; Accepted: January 15, 2025; Published: February 5, 2025
Copyright: © 2025 Meng et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data underlying the results presented in the study are available from github repository OpenTAD: https://github.com/sming256/OpenTAD.
Funding: This research was funded by Guizhou Power Grid Co. Ltd, grant number GZKJXM20222320.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Temporal Action Localization (TAL) is an important task in the field of video understanding. The goal of this task is to identify and locate the categories of all actions that appear in a video, as well as the corresponding start and end times of the actions. TAL technology has wide application value in video editing, sports game analysis, surveillance video processing, and other fields. In recent years, TAL technology has developed rapidly, and a number of high-performance localization models such as MTSN [1], Tallformer [2], Contextloc++ [3], ActionFormer [4] have emerged.
However, the training of these models relies on a large amount of manually annotated data. Labeling videos is complex, time-consuming, and expensive [5]. Therefore, in practical application scenarios, fully labeling all video data is not economically feasible. This has prompted the academic and industry communities to pay more attention to the application of semi-supervised learning (SSL) methods in the video field. Semi-supervised learning is a widely used training method in machine learning and deep learning, which features both labeled and unlabeled data in the training data. The purpose of semi-supervised learning is to fully explore the information of unlabeled data based on the training of labeled data.
Most existing semi-supervised learning methods follow the paradigm of self-training and consistency regularization [6–8]. Specifically, self-training methods are achieved by generating pseudo labels and using them to retrain the model, while consistency regularization is achieved by applying different levels of data augmentation and requiring consistent predictions. Data augmentation methods include masking, scaling, adding Gaussian noise, etc. Work in the field of semi-supervised temporal action localization [9, 10] also basically follows this paradigm.
However, these methods mainly focus on how to generate more accurate pseudo labels through more reliable feature representations, without paying attention to how to improve the quality of pseudo labels. The noise within pseudo-labels directly affects the effectiveness of semi-supervised learning, as shown in Fig 1. There are some attempts to alleviate the impact of false label noise through other means. NPL [5] attempts to aggregate classification and regression information to improve regression accuracy. SPOT [10] proposes a single-stage model which conducts classification and location simultaneously to generate pseudo labels. Their experimental results prove that integrating information from classification and regression branches can effectively improve the performance of semi-supervised learning.
Pseudo label noisy consists of two parts, one is the classification error, the other is boundary offset.
However, they still did not directly optimize the generated pseudo labels. Therefore, we propose a framework to improve the temporal localization accuracy of pseudo-labels, called Pseudo-Label Refining (PLR), which consists of three parts: Pseudo-Label Self-Refinement (PLSR), Boundary Synthesis (BS) and Adaptive Weight Learning (AWL).
Firstly, PLSR is a module designed specifically for temporal action localization tasks. It refines initial pseudo labels through multiple rounds of self-tuning and boundary synthesis. By extracting features from unlabeled video segments, applying linear interpolation and 1D convolutional layers, PLSR produces a deeper representation. This representation is then fed into a classifier and regressor to obtain refined action boundaries and classification confidence scores. PLSR effectively adapts the concept of bounding box refinement from 2-stage object detection to SS-TAL, enabling fine-tuning of action intervals.
Secondly, BS aims to mitigate the negative effects of inaccurate pseudo labels by leveraging information from surrounding non-maximum predictions. It achieves this through three steps: randomly disturbing existing action intervals, adding Gaussian noise to the corresponding features, and averaging the results of multiple predictions weighted by their confidence scores. This mechanism ensures that the final refined action boundaries incorporate valuable information from multiple perspectives, improving the robustness and accuracy of temporal action localization.
Finally, AWL tackles the issue of varying learning difficulties among unlabeled samples and the unreliability of pseudo labels generated by the teacher model. It assigns dynamic weights to pseudo labels based on their credibility, as measured by the averaged information entropy. During training, the model prioritizes learning from high-confidence labels while gradually incorporating information from lower-confidence ones. This adaptive approach enhances the overall performance and robustness of the temporal action localization model by allowing it to focus on more reliable data while still benefiting from the full range of unlabeled data.
The proposed method is tested on the THUMOS14 [11] and ActivityNet v1.3 [12] datasets, and the experimental results show that the method trained with PLR outperforms the previous state-of-the-art methods by up to 3.5% mAP under four different label rates: 10%, 20%, 40%, and 60%.
Our contributions can be divided into three points:
- We propose the Pseudo Label Self Refinement and Boundary Synthesis module tailored for temporal action localization. PLSR adapts bounding box refinement to the temporal domain for fine-tuning action intervals, as well as BS mitigates inaccuracies by leverages multiple inference, improving pseudo label accuracy.
- We propose Adaptive Weight Learning for Performance Boost. AWL dynamically weights pseudo labels based on credibility, prioritizing high-confidence labels and gradually incorporating lower-confidence ones.
- Extensive experiments conducted on two major benchmarks, THUMOS14 [11] and ActivityNet v1.3 [13], demonstrate that PLR effectively utilizes unlabeled data to enhance temporal action localization performance, outperforming prior state-of-the-art methods.
Related work
Temporal action localization
Temporal Action Localization (TAL) methods [14] are typically categorized into three types: two-stage, single-stage, and anchor-free methods. Two-stage methods [13, 15–19] initially generate candidate action proposals for each instance and then perform boundary regression and action recognition on each proposal. These proposals can be generated using fixed anchors of single or multiple sizes [20–23] or through direct boundary regression [24, 25]. Although two-stage frameworks often incorporate intricate fusion techniques, they are prone to complicated design and error propagation from the proposal stage. Single-stage methods [26–29], on the other hand, directly predict temporal boundaries and action categories for each instance. Anchor-free methods [4, 30–33] perform action boundary regression without relying on default anchors. Due to their simple yet efficient frameworks, recent anchor-free models such as ActionFormer [4], TriDet [31], and MSTN [1] have achieved remarkable performance in the TAL task. At present, related detection technology has been widely applied in industrial scenarios [34, 35].
Semi-supervised learning
Semi-Supervised Learning (SSL) methods [36] can generally be categorized into four types: generative methods [37, 38], graph-based methods [39, 40], consistency regularization methods [8, 41–45], and pseudo-labeling methods [46–49]. SSL finds extensive applications in visual tasks, including image classification, object detection, and semantic segmentation. Many SSL methods for these tasks utilize pseudo-label techniques [50–53] and the consistency learning framework [43, 44]. The Mean Teacher approach [8] averages model weights using Exponential Moving Average (EMA) to enhance performance without altering the network architecture. Active Teacher [54] evaluates unlabeled samples based on three key criteria, maximizing the utilization of limited label information. PseCo [55] introduces multi-view scale-invariant learning for object detection. Additionally, Zhang et al. [56] and PCL [57] aim to refine pseudo labels for more reliable training.
Semi-supervised temporal action localization
Existing Semi-Supervised Temporal Action Localization (SS-TAL) methods can be classified into two main branches. The first branch centers on consistency learning and self-training [9, 58, 59]. KFC [58] demands that a single model make consistent predictions for features with added spatial and temporal perturbations. Ji et al. [59] subsequently integrate the teacher-student framework with SS-TAL, introducing time warping and time masking strategies. Furthermore, SSTAP [9] introduces a self-supervised module that learns relation-aware features through feature reconstruction and clip-order prediction training. The second branch of SS-TAL explores post-processing methods [5, 10]. SPOT [10] proposes a single-stage TAL solution to mitigate error propagation between action proposal generation and classification. NPL [5] combines class scores with boundary ambiguity and limits the maximum number of predictions per video, offering a different perspective on SS-TAL. These research efforts have achieved favorable results in experiments. However, methods utilizing pseudo labels have not adequately focused on the pseudo labels themselves, which contain valuable video knowledge. Therefore, we propose our SS-TAL framework, PLR, which emphasizes addressing the noise within pseudo labels. Our research on utilizing and improving pseudo labels aims to fill this gap in the field of SS-TAL.
Methods
Preliminary
Problem definition.
The task of SS-TAL is to improve the performance of TAL network by fully utilizing both labeled videos and unlabeled videos
. Each action segment in the labeled video is annotated with yl = ((s, e), c), where s, e and c refer to start time, end time, and category of this action instance. The performance of TAL network is reflected in its accuracy in temporal localization
and classification
.
Video encoding.
To compare fairly with other SS-TAL methods, we use the same pre-trained video encoders on certain detectors as other methods to align the data input. More specifically, we use two stream I3D [60] and TSN [61] to extract video features, which is consistent with NPL [5] and SPOT [10]. The pre-extracted features of a single video can be represented as x. The video encoders are not updated during training.
Teacher-student framework.
The method presented is rooted in the Teacher-student framework [8]. This framework is characterized by two key aspects that synergistically enhance the learning process, particularly in semi-supervised learning scenarios.
Firstly, the teacher model is constructed as an Exponential Moving Average (EMA)-ensembled version of the student model. Mathematically, this can be expressed as:
(1)
where θ represents the model parameters, t denotes the time step corresponding to each update of student weight, and α is the EMA decay rate. The teacher model maintains an inference mode. Given the superior performance of the ensembled model, the teacher is leveraged to generate pseudo-labels
using the features of unlabeled samples xu, which are then used to supervise the training of the student model. This process can be formulated as:
(2)
where xl and yl represent features of labeled samples and their corresponding labels,
denotes the loss function. Specifically, we use IoU for localization loss and FocalLoss [62] for classification loss.
Secondly, a distinctive feature of this framework is the differential data exposure between the teacher and student models. The teacher model exclusively uses unlabeled data xu without noise during inference, while the student model is trained on data that includes noise, denoted as xu + ψ. This setup allows the student model to learn features that are robust to noise through the optimization constraints of consistent prediction. Specifically, the student model is encouraged to make predictions that are consistent with those of the teacher model, even when faced with noisy inputs. This robustness is then communicated back to the teacher model through the EMA updating mechanism, ensuring that the teacher also benefits from this noise-resistant learning.
Overview
We design a semi-supervised temporal action localization method based on teacher-student framework. The method consists of three parts: pseudo-label self-refinement (PLSR), boundary synthesis (BS), and adaptive weight learning (AWL). Before the start of training, the original videos are input to video encoder to obtain video features. Then, in the pre-training stage, the labeled video features are used to pre-train the initial model and PLSR module. Then, in the semi-supervised training stage, the initial model is used to initialize the student model and the teacher model. The unlabeled video features are input into the teacher model and the teacher model generates the initial pseudo labels. Then PLSR and BS are applied to refine the pseudo labels, which are used for training the student model. At the same time, the student model also receives training from labeled data. When updating weights according to loss, different weights are assigned based on the confidence of the pseudo-labels. The overall process of this method is shown in Fig 2.
In the feature extraction stage, labeled and unlabeled video features are obtained from the original video. In the pre-training stage, labeled features are used to pre-train the initial model. In the semi-supervised training stage, the initial model initializes the student and teacher models. The teacher model outputs initial pseudo-labels using unlabeled features, which are refined through self-refinement and boundary synthesis for subsequent student model training. The student model also trains on labeled data. Weight updates consider pseudo-label confidence.
Pseudo-label self-refinement
Motivation.
In object detection, Fast R-CNN [15] is a seminal work that incorporates a region proposal network (RPN) as well as feature extraction module, region of interest (ROI) pooling layer, classification module and boundary regression module, where boundary regressor plays the role of refining the proposal boxes:
(3)
where
denotes the feature map of the entire image, p refers to proposal generated by RPN, and
is the predicted bounding box. This process involves extracting features from the proposed region using the convolutional layers of the neural network, followed by applying fully connected layers to predict the coordinate offsets of the bounding box. Regressor itself implements fine-tuning of proposals, and can obviously also be used for fine-tuning of bounding boxes. Simply input the predicted bounding box b as a proposal to the latter half of neural networks:
(4)
Where brefined denotes the refined bounding box, and feature map z is reused.
In temporal action localization, the predicted action temporal boundary is a one-dimensional boundary, and the initial 2D regression structure is no longer applicable. To solve this problem, a pseudo-label self-refinement module (PLSR) tailored for 1D temporal boundary is proposed.
Implementation details of PLSR.
In temporal action localization, the pre-extracted features of a single video can be represented as x, and it will be processed by the backbone of detector to generate more latent representation where C represents the feature dimension and T represents the temporal dimension. Initially, the latent representation zs:e corresponding to the temporal interval I = (s, e) is selected. Then, linear interpolation is applied along the temporal dimension to resize it into a fixed-size representation
. In this method, T′ is set according to the distribution of action instance durations.
Due to the difference of tasks, the ROI pooling layer applied to the two-dimensional feature map is replaced with a one-dimensional convolutional layer with a convolution kernel. After this step, a deeper representation can be obtained:
(5)
Input z″ into the classifier and regressor of the original detector to obtain the classification prediction confidence and interval offset
, where Ncls represents the number of action types in the action set C. Input fc(z″) and fr(z″) into the corresponding two fully connected layers to obtain the class-wise prediction confidence score
and interval offset
. This process can be represented as follows:
(6)
Further, take the maximum value of as the classification confidence, and add
to the original I to obtain the fine-tuned temporal boundary:
(7)
Now we define . The above process can be referred to Fig 3.
Re-scaled and pooled representation Z″ is fed into the classifier and regressor to obtain prediction confidence and interval offsets, processed through fully connected layers, and used to obtain the fine-tuned temporal boundary.
Boundary synthesis
Motivation.
The latent representation of a single video mentioned above can be represented as a tensor of . Based on this, the classifier and regressor of the detector will perform T times of intensive predictions, including action category prediction and action interval prediction. Then, based on the confidence scores of different predicted categories, Non-Maximum Suppression (NMS) or SoftNMS is used for post-processing to obtain the final prediction. This means that the prediction with the highest score (maximum prediction) will make other predictions near it invisible, but due to the inaccuracy of the pseudo label, some inaccurate predictions will be given high confidence.
To mitigate the negative impact of this problem, it is necessary to make full use of the information from the context of maximum prediction. Based on pseudo-label self-refinement, a boundary synthesis mechanism based on multi-vote weight summation is proposed.
Implementation details of BS.
The process is divided into three steps, as shown in Fig 4.
It improves accuracy of pseudo labels by leveraging information from the context of maximum prediction through three steps: random boundary disturbance, adding Gaussian noise to corresponding features, and averaging multiple predictions weighted by their confidence scores.
The first step is to randomly add noise to the existing action intervals to obtain new action intervals, and then expand the intervals by 0.5 times the length to obtain expanded intervals:
(8)
where σ1 is the hyperparameter to be set. This step repeats K times to get
.
The second step is to select the features corresponding to these expanded intervals and add Gaussian noise to them:
(9)
where σ2 is the hyperparameter to be set. This process can also be written as
.
The third step is to infer the features of these specific intervals with noise:
(10)
where k ∈ [1, K]. This process can also be written as
.
Finally, average the inference results according to confidence score:
(11)
notice that the class-wise prediction confidence score
is retained for the following part.
Adative weight learning
Motivation.
The learning difficulty of different unlabeled samples varies significantly, with some being easier to learn from than others. Additionally, the unbalanced distribution of unlabeled samples further complicates the issue, as it results in pseudo labels generated by the teacher model being of unequal reliability.
To address this phenomenon and ensure that the learning process is not unduly influenced by unreliable pseudo labels, a dynamic weight is designed that comprehensively considers the credibility of each pseudo label. This approach allows the model to adaptively adjust the importance of each pseudo label based on its reliability, thereby improving the overall performance and robustness of the TAL model.
Implementation details of AWL.
Here, we use mean information entropy (mIE) to measure the pseudo label of certain action instance:
(12)
where K here is same as the one in BS, and C denotes the class number. When the output corresponding to the classification is more evenly distributed, the mIE is higher, and the corresponding pseudo-labels are less reliable. To give a simple example, when the classification output is entirely 0 or 1, the mIE is 0; in other cases, the mIE value is greater than 0.
Our approach is to initially focus on learning from high-confidence pseudo labels and gradually increase the weight of low-confidence pseudo labels, loss weight for each epoch is calculated by:
(13)
where epoch represents the training round, and epochmax represents the maximum number of training rounds. wunsupervised is used as a factor multiplied by the loss calculated by pseudo labels. At the initial stage of training, the coefficient in front of mIE is -1, meaning that pseudo-labels with higher mIE (i.e., lower confidence) have smaller weights, and conversely, pseudo-labels with higher confidence have larger weights. As training approaches its end, the coefficient in front of mIE becomes 1, meaning that pseudo-labels with higher mIE (i.e., lower confidence) have larger weights, and conversely, pseudo-labels with higher confidence have smaller weights. This way, we can prioritize the learning from more reliable labels while still incorporating information from less reliable ones, ultimately improving the overall performance and robustness of the model.
Training process
In this section, we describe the training strategy we devised. To provide a more intuitive explanation of our method, we present the training pseudocode in Algorithm 1. The first training stage involves training a model and PLSR using only labeled data, and initializing both the teacher model and the student model with this trained model. In the second training stage, at the beginning of each epoch, the teacher model combined with PLSR and BS is used to generate pseudo-labels for all unlabeled data. During this process, the parameters of the PLSR component continue to be trained with labeled data, while the teacher model is not updated. These pseudo-labels are then used to train the student model, which subsequently updates the weight of the student model to the teacher model via Exponential Moving Average (EMA).
Algorithm 1 Training process of the proposed PLR
Require: Training dataset D = {(xl, xu, yl)} and hyperparameters: α, K, σ1, σ2
1: Initialize model parameters θmodel: (θbackbone, θhead), θPLSR.
▹ Stage 1: Supervised Pre-training
2: for each training iteration do
3: Sample a batch B = {(xl, yl)} from D
4: zl ← fbackbone(xl)
5:
6:
7:
8:
9: UpdateParameters()
10: UpdateParameters()
11: end for
▹ Stage 2: Semi-Supervised Training
12: θstudent(train), θteacher(eval) ← θmodel
13: for each training iteration do
14: Sample a batch B = {(xl, xu, yl)} from D
▹ Supervised Part
15: zl ← fstudent_backbone(xl)
16:
17:
18:
19:
▹ Unsupervised Part
20: zu ← fteacher_backbone(xu)
21:
22:
23:
24:
25: for k from 1 to K do
26:
27:
28: end for
29:
30:
▹ Update Parameters
31: UpdateParameters()
32: UpdateParameters()
33: θteacher ← αθteacher + (1 − α)θstudent
34: end for
Results and discussion
Dataset
We evaluate our methods on two TAL benchmark datasets, THUMOS14 [11] and ActivityNet v1.3 [12]. THUMOS14 [11] provides 1010 videos for validation, with 220 temporally annotated videos, and 1574 videos for testing, with 212 temporally annotated videos from 20 action categories. Following common practice, the proposed network is trained on the validation set and evaluated on the test set. The ActivityNet-v1.3 [12] is classified into training, validation, and test subsets, which contain 10024, 4926, and 5044 videos from 200 categories, respectively. Following the standard evaluation protocol, the network is trained on the test subset, and the evaluation results are reported on the validation subset.
For semi-supervised task, randomly select 10%, 20%, 40%, and 60% of samples from each action category as labeled samples, and the remaining samples are considered unlabeled samples. Our experiment is conducted at these four label rates.
Implementation details
We use ActionFormer [4] and BMN [17] as the basic TAL model for our method, following NPL [5].
When employing ActionFormer as the base detector, in the pre-training stage, we only use labeled data to pretrain for 40/15 epochs(for THUMOS14 and ActivityNet v1.3 respectively), the learning rate is set to 1e-4/1e-3, and the learning rate strategy includes 10/5 epochs of warmup and 30/10 epochs of cosine annealing, weight decay is set to 0.05, EMA α is set to 0.995 for the 10% label rate and 0.999 for the 60% label rate. In the semi-supervised stage, we utilize both labeled and unlabeled data for training, and set EMA α to 0.999. The remaining training settings remain identical to those used in the pre-training stage.
When using BMN as the base detector, in the pre-training stage, we use labeled data to pretrain for 30/30 epochs with a learning rate of 1e-3/4e-3 and a weight decay of 5e-3/5e-3. Further, we improve the model with pseudo labels for 15 epochs with EMA α been set to 0.99. The temporal dimension of BMN is set to 100/256 for ActivityNet/THUMOS14 respectively.
For key hyper-parameters, K is set to 4, along with σ1 set to 0.05 and σ2 set to 0.01. For data augmentation, we apply a 10% time mask, Gaussian sampled temporal scaling. All pseudo labels are filtered based on a class-confidence threshold of 0.2. Besides, SoftNMS is employed for post-processing and experiments are conducted on RTX 2080 Ti × 1 with fixed random seed.
Comparison with other methods
We compare the proposed method PLR with existing main SS-TAL methods in Table 1.
The table compares the performance of our proposed PLR method with several state-of-the-art approaches on the THUMOS14 and ActivityNet v1.3 datasets. The results are presented in terms of average precision(AP) and mean average precision(mAP). For THUMOS14 [11], thresholds of IoU are [0.3:0.7:0.1]. And for ActivityNet v1.3, thresholds of IoU are [0.5:0.95:0.05], we report three sets of AP and mAP as others do.
On THUMOS14 [11], our PLR method, when combined with ActionFormer [4] and BMN [17], achieves the highest mAP of 61.2% and 42.9% at 60% label rate, outperforming other methods such as MixUp [63], NPL [5], and SPOT [10]. Similarly, at other label rates, PLR surpasses other SOTA methods as well.
On ActivityNet v1.3 [12], the PLR method consistently improves the performance of the base models, ActF and BMN, across different thresholds and datasets.
Overall, the results demonstrate the effectiveness of our PLR method in enhancing the performance of existing TAL models, achieving state-of-the-art results on both THUMOS14 and ActivityNet v1.3 datasets.
Ablation and discussion
To test the effectiveness of each module in our proposed method, we design the following ablation experiments where ActionFormer [4] works as action detector with I3D [60] features.
Effectiveness of individual modules.
We started with training using only pseudo labels and gradually added various modules to observe the effect of each module. Experiments are conducted at two tag rates of 10% and 60%, as results can be seen in Table 2. After adding pseudo label self-refinement(PLSR) separately, the mAP increased by 3.9% and 3.2%. When applying boundary synthesis(BS), the mAP increased by 4.0% and 3.7%. With adopting adaptive weight combined with PLSR and BS in training process getting corrected, the mAP goes to 22.3% and 61.2%.
Effectiveness of pseudo-label self-refinement.
The PLSR module is specifically designed to refine the temporal boundaries of the pseudo labels, aiming to enhance the precision of these labels in terms of their temporal alignment. To thoroughly evaluate its performance, we examine it from an Intersection over Union (IoU) perspective. This metric allows us to quantify the degree of overlap between the predicted temporal boundaries and the ground-truth boundaries. Consequently, the effectiveness of the PLSR module’s boundary correction can be accurately reflected by the mean IoU score calculated between the predictions and the ground-truths. A higher mean IoU indicates better boundary refinement, demonstrating the module’s capability to improve the accuracy of pseudo labels in temporal localization tasks.
As shown in Table 3, model trained with refined pseudo labels exhibit better localization ability compared to model trained with vanilla pseudo labels, which demonstrates the effectiveness of PLSR. At a label rate of 10%, using fine-tuned pseudo labels, the average IoU and accuracy improved by 0.05 compared to before fine-tuning, which also increased the mAP by 3.9%. The same was true at a label rate of 60%, where the average IoU of pseudo labels improved by 0.08 compared to before, resulting in a 2.8% increase in accuracy.
Effectiveness of boundary synthesis.
The role of boundary synthesis(BS) is to reduce the inherent bias of the model through multiple inferences, which is similar to the specific application of TTA in TAL tasks during inference. BS consists of three main components: random boundary disturbance, adding Gaussian noise during inference, and boundary aggregation. We have separately tested the effectiveness of random boundary disturbance and Gaussian noise, as results reported in Table 4.
It should be notice that “+ Gaussian noise” represents using the exact same feature interval. It can be seen that Gaussian noise plays a key role in multiple inferences, and the model reduces the expected bias in multiple inferences. At the same time, random boundary disturbance makes this optimization effect better.
Impact of random seeds.
We select 10 random seeds for a fixed set of settings to observe the impact of random seeds. We find that random seeds can cause the final mAP to fluctuate by up to 0.5%. However, our overall improvement is significantly higher than previous SOTA, thus indirectly reflecting the effectiveness of the PLR method. The averaged performance curve is shown in Fig 5.
Inference speed
Although our proposed method significantly improves the quality of pseudo-labels, it requires more training time compared to conventional semi-supervised methods. Specifically, with K = 4 in this paper, it necessitates an additional 6 iterations of PLSR module inference compared to the most basic pseudo-labeling approach. To address this issue, we approach it from two perspectives: first, optimizing the code workflow to increase computational parallelism; second, selecting the most cost-effective value of K. We have changed the K iterations of inference to a single inference with the batch size increased by a factor of K. By optimizing the code workflow, we have parallelized the K iterations of the PLSR process, greatly reducing the extra inference overhead. According to our tests, processing one batch of samples through ActionFormer takes approximately 450ms, including pseudo label generation. While processing through PLSR takes about 110ms. If 6 additional iterations of PLSR module inference are required, the inference time for a single batch increases to 1120ms, representing a 148% increase over the original time. After parallelizing the K iterations of PLSR, the inference time for a single batch is reduced to 810ms, resulting in a 80%(68% less) increase compared to the vanilla pseudo label method. We conducted a search test for the parameter K, and below are the results for ActionFormer on the THUMOS14 dataset with a 40% labeling rate, as shown in Fig 6. We kept PLSR and AWL constant while varying the value of K. When K = 0, it means not using BS, resulting in two PLSR forward passes; when K = N(N > 0), it results in two PLSR inferences plus one large-batch parallelized PLSR forward pass. The test results are as follows: when K < = 4, the additional computational overhead is relatively small, and there is a significant performance improvement. When K > 4, there is no notable performance gain, but it introduces additional overhead. Based on these experimental results, we ultimately determined K to be 4.
Additional training time, taking THUMOS14 + ActionFormer as an example.
In practical applications, using the model trained by our method for inference does not introduce any additional inference time, adhering to the original inference paradigm of the model. In other words, the extra time consumption associated with our method is introduced during the teacher-student framework training process. If only the training time is considered, without taking the inference scenario into account, the additional overhead incurred by our method is entirely acceptable, as can be seen in Table 5. Taking THUMOS14 as an example, extracting features from all these videos alone requires over 100 hours (thanks to previous researchers, as practical video data does not come with pre-extracted features), while our overall training time increases from 60 minutes with only standard pseudo-labels to approximately 100 minutes with our new method. In terms of training, the additional overhead of our method is less than 1%, yet it delivers significant performance improvements.
Conclusion
This study proposes a pseudo-label refining method based on the semi-supervised teacher-student framework. It involves pseudo-label self-refinement, boundary synthesis and adaptive weight learning. Using ActionFormer [4] and BMN [17] as the detector, the method achieves significant improvement on the THUMOS14 [11] and ActivityNet v1.3 [12] dataset, outperforming other SSTAL methods in mAP at label rates of 10% to 60%. Ablation experiments demonstrate the effectiveness of each module in enhancing pseudo-label accuracy. The method proposed in this paper improves the accuracy of pseudo-labels by introducing new modules and increasing computational efforts. We believe that future research directions for SS-TAL can be explored in the following ways:
- Better module design to provide a fine-grained mechanism for pseudo-label refinement, this has been preliminarily studied in SPOT [10];
- In-depth investigation of the training process, as there is currently a lack of comprehensive discussion on training procedures under the same benchmark;
- More reliable evaluation metrics for pseudo-labels. Currently, in SS-TAL, NPL [5] offers a pseudo-label evaluation metric that incorporates boundary confidence, but we believe there is still room for improvement.
We hope that our research can provide some inspiring ideas for the development of this field.
Acknowledgments
The authors would like to express gratitude to the editor and reviewers for their thorough review of the paper and the valuable suggestions they provide to enhance the overall quality of the manuscript.
References
- 1.
Jin X, Zhang T. MTSN: Multiscale Temporal Similarity Network for Temporal Action Localization. In: Proceedings of the 31st ACM International Conference on Multimedia; 2023. p. 2573–2581.
- 2.
Cheng F, Bertasius G. TALLFormer: Temporal Action Localization with a Long-Memory Transformer. In: European Conference on Computer Vision (ECCV) 2022; 2022. p. 503–521.
- 3. Zhu Z, Wang L, Tang W, Zheng N, Hua G. ContextLoc++: A Unified Context Model for Temporal Action Localization. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2023;45(8):9504–9519. pmid:37021919
- 4.
Zhang CL, Wu J, Li Y. ActionFormer: Localizing Moments of Actions with Transformers. In: European Conference on Computer Vision (ECCV) 2022; 2022.
- 5.
Xia K, Wang L, Zhou S, Hua G, Tang W. Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Localization. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV); 2023. p. 10126–10135.
- 6.
Zhou Q, Yu C, Wang Z, Qian Q, Li H. Instant-Teaching: An End-to-End Semi-Supervised Object Detection Framework. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2021. p. 4079–4088.
- 7.
Liu YC, Ma CY, He Z, Kuo CW, Chen K, Zhang P, et al. Unbiased Teacher for Semi-Supervised Object Detection. In: Proceedings of the International Conference on Learning Representations (ICLR); 2021.
- 8.
Tarvainen A, Valpola H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. NIPS’17. Red Hook, NY, USA: Curran Associates Inc.; 2017. p. 1195–1204.
- 9.
Wang X, Zhang S, Qing Z, Shao Y, Gao C, Sang N. Self-Supervised Learning for Semi-Supervised Temporal Action Proposal. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2021. p. 1905–1914.
- 10.
Nag S, Zhu X, Song YZ, Xiang T. Semi-supervised Temporal Action Detection with Proposal-Free Masking. In: European Conference on Computer Vision (ECCV) 2022; 2022. p. 663–680.
- 11.
Jiang Y, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, et al. THUMOS challenge: Action recognition with a large number of classes. 2014;.
- 12.
Heilbron FC, Escorcia V, Ghanem B, Niebles JC. ActivityNet: A large-scale video benchmark for human activity understanding. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2015. p. 961–970.
- 13.
Li J, Liu X, Zong Z, Zhao W, Zhang M, Song J. Graph Attention Based Proposal 3D ConvNets for Action Detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34; 2020. p. 4626–4633.
- 14. Wang B, Zhao Y, Yang L, Long T, Li X. Temporal Action Localization in the Deep Learning Era: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2024;46(1):2171–2190. pmid:37930912
- 15.
Girshick R. Fast R-CNN. In: International Conference on Computer Vision (ICCV); 2015.
- 16.
Kang H, Kim H, An J, Cho M, Kim SJ. Soft-Landing Strategy for Alleviating the Task Discrepancy Problem in Temporal Action Localization Tasks. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2023. p. 6514–6523.
- 17.
Lin T, Liu X, Li X, Ding E, Wen S. BMN: Boundary-Matching Network for Temporal Action Proposal Generation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV); 2019. p. 3888–3897.
- 18. Xu L, Wang X, Liu W, Feng B. Cascaded Boundary Network for High-Quality Temporal Action Proposal Generation. IEEE Transactions on Circuits and Systems for Video Technology. 2020;30(10):3702–3713.
- 19.
Zhu Z, Tang W, Wang L, Zheng N, Hua G. Enriching Local and Global Contexts for Temporal Action Localization. 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021; p. 13496–13505.
- 20. Chen G, Zhang C, Zou Y. AFNet: Temporal Locality-Aware Network With Dual Structure for Accurate and Fast Action Detection. IEEE Transactions on Multimedia. 2021;23:2672–2682.
- 21.
!Qing Z, Su H, Gan W, Wang D, Wu W, Wang X, et al. Temporal Context Aggregation Network for Temporal Action Proposal Refinement. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR);.
- 22.
Xu H, Das A, Saenko K. R-C3D: Region Convolutional 3D Network for Temporal Activity Detection. In: Proceedings of the International Conference on Computer Vision (ICCV); 2017.
- 23. Xu H, Das A, Saenko K. Two-Stream Region Convolutional 3D Network for Temporal Activity Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2019;41(10):2319–2332. pmid:31180838
- 24.
Ning R, Zhang C, Zou Y. SRF-Net: Selective Receptive Field Network for Anchor-Free Temporal Action Detection. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2021. p. 2460–2464.
- 25.
Wang Q, Zhang Y, Zheng Y, Pan P. RCL: Recurrent Continuous Localization for Temporal Action Detection. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2022. p. 13556–13565.
- 26.
Chao YW, Vijayanarasimhan S, Seybold B, Ross DA, Deng J, Sukthankar R. Rethinking the Faster R-CNN Architecture for Temporal Action Localization. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2018. p. 1130–1139.
- 27.
Gao J, Yang Z, Sun C, Chen K, Nevatia R. TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals. In: 2017 IEEE International Conference on Computer Vision (ICCV); 2017. p. 3648–3656.
- 28.
Huang Y, Dai Q, Lu Y. Decoupling Localization and Classification in Single Shot Temporal Action Detection. 2019 IEEE International Conference on Multimedia and Expo (ICME). 2019; p. 1288–1293.
- 29.
Long F, Yao T, Qiu Z, Tian X, Luo J, Mei T. Gaussian Temporal Awareness Networks for Action Localization. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2019. p. 344–353.
- 30. Liu X, Wang Q, Hu Y, Tang X, Zhang S, Bai S, et al. End-to-End Temporal Action Detection With Transformer. IEEE Transactions on Image Processing. 2022;31:5427–5441. pmid:35947570
- 31.
Shi D, Zhong Y, Cao Q, Ma L, Li J, Tao D. TriDet: Temporal Action Detection with Relative Boundary Modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023. p. 18857–18866.
- 32.
Shi D, Zhong Y, Cao Q, Zhang J, Ma L, Li J, et al. ReAct: Temporal Action Detection with Relational Queries. In: European Conference on Computer Vision (ECCV) 2022; 2022.
- 33.
Zhao C, Liu S, Mangalam K, Ghanem B. Re2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022; p. 10637–10647.
- 34. Zhu H, Li H, Liu Y, Qian X, Liu Y, Gao M. Tower Related Object Detection Method Based on Improved YOLOv5s Network. Power Systems and Big Data. 2023;26(5):62–72.
- 35. Yu Y, Dong Z, Li J, Zhao B, Yang K, Guo D, et al. Algorithm for Recognizing the Status of Plates and Indicators Lights Based on Machine Vision Technology. Power Systems and Big Data. 2024;27(6):81–92.
- 36. Yang X, Song Z, King I, Xu Z. A Survey on Deep Semi-Supervised Learning. IEEE Transactions on Knowledge and Data Engineering. 2023;35(9):8934–8954.
- 37.
Kingma DP, Welling M. Auto-Encoding Variational Bayes. In: International Conference on Learning Representations (ICLR); 2013.
- 38.
Krichen M. Generative Adversarial Networks. In: 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT); 2023. p. 1–7.
- 39.
Wang D, Cui P, Zhu W. Structural Deep Network Embedding. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016;.
- 40. Zhu W, Wang X, Cui P. Deep Learning for Learning Graph Representations. Deep Learning: Concepts and Architectures. 2019; p. 169–210.
- 41.
Ke Z, Wang D, Yan Q, Ren JSJ, Lau RWH. Dual Student: Breaking the Limits of the Teacher in Semi-supervised Learning. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019; p. 6727–6735.
- 42.
Laine S, Aila T. Temporal Ensembling for Semi-Supervised Learning. In: Proceedings of the International Conference on Learning Representations; 2016.
- 43.
Sajjadi MSM, Javanmardi M, Tasdizen T. Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems; 2016. p. 1171–1179.
- 44.
Zhang C, Yang T, Weng J, Cao M, Wang J, Zou Y. Unsupervised Pre-training for Temporal Action Localization Tasks. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2022. p. 14011–14021.
- 45.
Xie Q, Luong MT, Hovy E, Le QV. Self-Training With Noisy Student Improves ImageNet Classification. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020. p. 10684–10695.
- 46.
Blum A, Mitchell TM. Combining labeled and unlabeled data with co-training. In: COLT’ 98; 1998.
- 47.
Chen D, Wang W, Gao W, Zhou ZH. Tri-net for Semi-Supervised Deep Learning. In: International Joint Conference on Artificial Intelligence (IJCAI); 2018.
- 48.
Lee DH. Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks; 2013.
- 49. Zhou ZH, Li M. Tri-training: exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering. 2005;17(11):1529–1541.
- 50.
Rizve MN, Duarte K, Rawat YS, Shah M. In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning. In: International Conference on Learning Representations (ICLR); 2021.
- 51.
Mukherjee S, Hassan Awadallah A. Uncertainty-aware Self-training for Few-shot Text Classification. In: Advances in Neural Information Processing Systems (NeurIPS). Online; 2020.
- 52.
Wei C, Sohn K, Mellina C, Yuille A, Yang F. CReST: A Class-Rebalancing Self-Training Framework for Imbalanced Semi-Supervised Learning. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2021. p. 10852–10861.
- 53.
Zou Y, Yu Z, Liu X, Kumar BVKV, Wang J. Confidence Regularized Self-Training. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV); 2019. p. 5981–5990.
- 54.
Mi P, Lin J, Zhou Y, Shen Y, Luo G, Sun X, et al. Active Teacher for Semi-Supervised Object Detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2022.
- 55.
Li G, Li X, Wang Y, Wu Y, Liang D, Zhang S. PseCo: Pseudo Labeling and Consistency Training for Semi-Supervised Object Detection. In: European Conference on Computer Vision (ECCV) 2022; 2022.
- 56.
Zhang L, Sun Y, Wei W. Mind the Gap: Polishing Pseudo Labels for Accurate Semi-supervised Object Detection. In: Proceedings of the AAAI Conference on Artificial Intelligence; 2023. p. 3463–3471.
- 57.
He Y, Chen W, Liang KJ, Tan Y, Liang Z, Guo Y. Pseudo-label Correction and Learning For Semi-Supervised Object Detection. ArXiv. 2023;abs/2303.02998.
- 58. Ding X, Wang N, Gao X, Li J, Wang X, Liu T. KFC: An Efficient Framework for Semi-Supervised Temporal Action Localization. IEEE Transactions on Image Processing. 2021;30:6869–6878. pmid:34319876
- 59.
Ji J, Cao K, Niebles JC. Learning Temporal Action Proposals With Fewer Labels. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV); 2019. p. 7072–7081.
- 60.
Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017. p. 6299–6308.
- 61.
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In: European Conference on Computer Vision (ECCV) 2016; 2016.
- 62.
Lin TY, Goyal P, Girshick R, He K, Dollár P. Focal Loss for Dense Object Detection. In: 2017 IEEE International Conference on Computer Vision (ICCV); 2017. p. 2999–3007.
- 63.
Hongyi Zhang YNDDLP Moustapha Cisse. mixup: Beyond Empirical Risk Minimization. International Conference on Learning Representations. 2018;.