The service usage analysis, aiming at identifying customers' messaging behaviors based on encrypted App traffic flows, has become a challenging and emergent task for service providers. Prior literature usually starts from segmenting a traffic sequence into single-usage subsequences, and then classify the subsequences into different usage types. However, they could suffer from inaccurate traffic segmentations and mixed-usage subsequences. To address this challenge, we exploit a multi-label multi-view learning strategy and develop an enhanced framework for in-App usage analytics. Specifically, we first devise an enhanced traffic segmentation method to reduce mixed-usage subsequences. Besides, we develop a multi-label multi-view logistic classification method, which comprises two alignments. The first alignment is to make use of the classification consistency between packet-length view and time-delay view of traffic subsequences and improve classification accuracy. The second alignment is to combine the classification of single-usage subsequence and the post-classification of mixed-usage subsequences into a unified multi-label logistic classification problem. Finally, we present extensive experiments with real-world datasets to demonstrate the effectiveness of our approach. We find that the proposed multi-label multi-view framework can help overcome the pain of mixed-usage subsequences and can be generalized to latent activity analysis in sequential data, beyond in-App usage analytics.
The analysts at a cybersecurity operations center (CSOC) analyze the alerts that are generated by intrusion detection systems (IDSs). Under normal operating conditions, sufficient numbers of analysts are available to analyze the alert workload. For the purpose of this paper, this means that the cybersecurity analysts in each shift can fully investigate each and every alert that is generated by the IDSs in a reasonable amount of time, and perform their normal tasks in a shift. Normal tasks include analysis time, time to attend training programs, report writing time, personal break time, and time to update the signatures on new patterns in alerts as detected by the IDS. There are number of disruptive factors that occur randomly, and can adversely impact the normal operating condition of a CSOC such as 1) higher alert generation rates from a few IDSs, 2) new alert patterns that decreases the throughput of the alert analysis process, and 3) analyst absenteeism. The impact of all the above factors is that the alerts wait for a long duration before being analyzed, which impacts the Level of Operational Effectiveness (LOE) of the CSOC. In order to return the CSOC to normal operating conditions, the manager of a CSOC can take several actions such as increasing the alert analysis time spent by analysts in a shift by cancelling a training program, spending some of their own time to assist the analysts in alert investigation, and calling upon the on-call analyst workforce to boost the service rate of alerts. However, additional resources are limited in quantity over a 14-day work cycle, and the CSOC manager must determine when and how much action to take in the face of uncertainty, which arises from both the intensity and the random occurrences of the disruptive factors. The above decision by the CSOC manager is non-trivial and is often made in an ad-hoc manner using prior experiences. This paper develops a reinforcement learning (RL) model for optimizing the LOE throughout the entire 14-day work cycle of a CSOC in the face of uncertainties due to disruptive events. Results indicate that the RL model is able to assist the CSOC manager with a decision support tool to make better decisions than current practices in determining when and how much resource to allocate when the LOE of a CSOC deviates from the normal operating condition.
Co-saliency detection is a newly emerging and rapidly growing research area in computer vision community. As a novel branch of visual saliency, co-saliency detection refers to the discovery of common and salient foregrounds from two or more relevant images, and can be widely used in many computer vision tasks. The existing co-saliency detection algorithms mainly consist of three components: extracting e ective features to represent the image regions, exploring the informative cues or factors to characterize co-saliency, and designing e ective computational frameworks to formulate co-saliency. Although numerous methods have been developed, the literature is still lacking a deep review and evaluation of co-saliency detection techniques. In this paper, we aim at providing a comprehensive review of the fundamentals, challenges, and applications of co-saliency detection. Specifcally, we provide an overview of some related computer vision works, review the history of co-saliency detection, summarize and categorize the major algorithms in this research area, discuss some open issues in this area, present the potential applications of co-saliency detection, and fnally point out some unsolved challenges and promising future works. We expect this review to be benefcial to both fresh and senior researchers in this feld, and give insights to researchers in other related areas regarding the utility of co-saliency detection algorithms.
It is important to be able to determine the varying effects of an intervention on patients' health. For new medical treatments, it is often the case that some patients do not respond, or worse yet, have adversary reactions. In this work, we are interested in identifying distinctive subpopulations that respond to the given intervention in particular ways, called the heterogeneity of the treatment effect (HTE) across subpopulations. For this purpose, we have developed a Bayesian mixture model. The novelty of our approach is that it combines the following features: complex decision boundaries, soft clustering, multivariate outcomes and prior knowledge. The last feature can be very useful for datasets with small sample sizes. We demonstrate how our method works by applying it to both simulated and real data. Results of our evaluation show that our model has strong predictive power and is capable of producing high quality clusters.
The scarcity of potable water is a critical challenge in many regions around the world. Previous studies have shown that knowledge of device level water usage can lead to significant conservation. Although there is considerable interest in determining discriminative features via sparse coding for water disaggregation to separate whole house consumption into its component appliances, existing methods lack a mechanism for fitting coefficient distributions and are thus unable to accurately discriminate parallel devices' consumption. This paper proposes a Bayesian discriminative sparse coding model, referred to as Virtual Metering (VM), for this disaggregation task. Mixture-of-Gammas is employed for the prior distribution of coefficients, contributing two benefits: (1) guaranteeing the coefficients' sparseness and non-negativeness; and (2) capturing the distribution of active coefficients. The resulting method effectively adapts the bases to aggregated consumption to facilitate discriminative learning in the proposed model, and devices' shape features are formalized and incorporated into Bayesian sparse coding to direct the learning of basis functions. Compact Gibbs Sampling (CGS) is developed to accelerate the inference process by utilizing the sparse structure of coefficients. The empirical results obtained from applying the new model to large scale real and synthetic datasets revealed that VM significantly outperformed the benchmark methods.
Social media provides a platform for seeking information from a large user base. Information seeking in social media, however, occurs simultaneously with users expressing their viewpoints by making statements. Rhetorical questions, an important tool employed by users to express their viewpoints, have the form of a question but serve the function of a statement. Rhetorical questions might, therefore, mislead platforms assisting information seeking in social media. It becomes difficult to identify rhetorical questions are they not syntactically different from other questions. In this paper, we develop a framework to identify rhetorical questions by modeling the possible motivations of the users to post them. We focus on two possible motivations of the users drawing from linguistic theories, to implicitly convey a message and to modify the strength of a statement previously made. We develop a quantitative framework from these motivations to identify rhetorical questions in social media. We evaluate the framework using two datasets of questions posted on a social media platform Twitter and demonstrate its effectiveness in identifying rhetorical questions. This is the first framework, to the best of our knowledge, to model the possible motivations for posting rhetorical questions to identify them on social media platforms.
Multi-label learning has become an important area of research, owing to the increasing number of real-world problems that contain multi-label data. Data labelling is an expensive process that requires expert handling. The annotation of multi-label data is laborious, since a human expert needs to consider the presence/absence of each possible label. Consequently, numerous modern multi-label problems may involve a small number of labelled examples and plentiful unlabelled examples simultaneously. Active learning methods allow to induce better classifiers by selecting the most useful unlabelled data, thus considerably reducing the labelling effort and the cost of training an accurate model. Batch-mode active learning methods focus on selecting a set of unlabelled examples in each iteration, in such a way that the selected examples are informative, and they are as diverse as possible. This paper presents a strategy to perform batch-mode active learning on multi-label data. The batch-mode active learning is formulated as a multi-objective problem, and it is solved by means of an evolutionary algorithm. Extensive experiments were conducted in a large collection of datasets, and the experimental results confirmed the effectiveness of our proposal for better batch-mode multi-label active learning.
Query expansion under the pseudo relevance feedback (PRF) framework has been extensively studied in information retrieval. However, most expansion methods are mainly based on the statistics of single terms, which can generate plenty of irrelevant query terms and decrease the retrieval performance. To alleviate this problem, we propose an approach that adapts the PRF-based contextual snippets into a context-aware topic model to enhance query representations. Specifically, instead of selecting a series of independent terms, we make full use of the query contextual information and focus on the snippets with the length of n in the PRF documents. Furthermore, we propose a context-aware topic (CAT) model to mine the topic distributions of the query relevant snippets, namely fine contextual snippets. Different from the traditional topic models that infer the topics from the whole corpus, we establish a bridge between the snippets and the corresponding PRF documents which can be used for modeling the topics more precisely and efficiently. Finally, the topic distributions of the fine snippets are used for query representations, which are both context-aware and topic-sensitive. To evaluate the performance of our approach, we integrate the obtained queries into a topic-based hybrid retrieval model, and conduct extensive experiments on various TREC collections. The experimental results show that our query modeling approach is more effective in boosting the retrieval performance compared with the state-of-the-art methods.
With the rapid growth of social media, massive misinformation is also spreading widely on social media, such as Weibo and Twitter, and brings negative effects to human life. Nowadays, automatic misinformation identification has drawn attention from academic and industrial communities. For an event on social media usually consists of multiple microblogs, current methods are mainly constructed based on global statistical features. However, information on social media is full of noisy, which should be alleviated. Moreover, most of microblogs about an event have little contribution to the identification of misinformation, where useful information can be easily overwhelmed by useless information. Thus, it is important to mine significant microblogs for constructing a reliable misinformation identification method. In this paper, we propose an Attention-based approach for Identification of Misinformation (AIM). Based on the attention mechanism, AIM can select microblogs with largest attention values for misinformation identification. The attention mechanism in AIM contains two parts: content attention and dynamic attention. Content attention is calculated based textual features of each microblog. Dynamic attention is related to the time interval between the posting time of a microblog and the beginning of the event. To evaluate AIM, we conduct a series of experiments on the Weibo dataset and the Twitter dataset, and the experimental results show that the proposed AIM model outperforms the state-of-the-art methods.
Question retrieval, which aims to find similar questions of a given question, is playing a pivotal role in various question answering (QA) systems. This task is quite challenging mainly in five aspects: lexical gap, polysemy, word order, question length, and data sparsity. In this paper, we propose a unified framework to simultaneously handle these five problems. We use the word combined with corresponding concept information to handle the lexical gap problem and the polysemous problem. The concept embedding and word embedding are learned at the same time from both context-dependent and context-independent view. To handle the word order problem, we propose a high-level feature embedded convolutional semantic model to learn the question embedding by inputting the concept embedding and word embedding. Due to the fact that the lengths of some questions are long, we propose a value-based convolutional attentional method to enhance the proposed high-level feature embedded convolutional semantic model in learning the key parts of the question and the answer. The proposed high-level feature embedded convolutional semantic model nicely represents the hierarchical structures of word information and concept information in sentences with their layer-by-layer convolution and pooling. Finally, to resolve the data sparsity, we propose to use the multi-view learning method to train the attention based convolutional semantic model on question answer pairs. To the best of our knowledge, we are the first who propose to simultaneously handle the above five problems in question retrieval using one framework. Experiments on two real question answering datasets show that the proposed framework significantly outperforms the state-of-the-art solutions.
Understanding human visual attention is essential for understanding human cognition, which in turn benefits human-computer interaction. Recent work has demonstrated a Personalized, Auto-Calibrating Eye-tracking system (PACE), which makes it possible to achieve accurate gaze estimation using only an off-the-shelf webcam by identifying and collecting data implicitly from user interaction events. However, this method is constrained by the need for large amounts of well-annotated data. We thus present fast-PACE, an adaptation to PACE that adapts knowledge from existing source data to accelerate the learning speed of the personalized model. The result is an adaptive, data-driven approach that continuously recalibrates, adapts and improves with additional use. Experimental evaluations of fast-PACE demonstrate its competitive accuracy of iris localization, validity of alignment identification between gaze and interactions, and effectiveness of gaze transfer. In general, fast-PACE achieves an initial visual error of 3.98º, and then steadily improves to 2.52º given incremental interaction-informed data. Our performance is comparable to state-of-the-art, but without the need for explicit training or calibration. Our technique addresses the data quality and quantity problems. It therefore has the potential to enable comprehensive gaze-aware applications in the wild.
In conventional supervised learning paradigm, each data instance is associated with one single class label. Multi-label learning differs in the way that data instances may belong to multiple concepts simultaneously, which naturally appear in a variety of high impact domains, ranging from bioinformatics, information retrieval to multimedia analysis. It targets to leverage the multiple label information of data instances to build a predictive learning model which can classify unlabeled instances into one or multiple predefined target classes. In multi-label learning, even though each instance is associated with a rich set of class labels, the label information could be noisy and incomplete as the labeling process is both time consuming and labor expensive, leading potential missing annotations or even erroneous annotations. The existence of noisy and missing labels could negatively affect the performance of underlying learning algorithms. More often than not, multi-labeled data often has noisy, irrelevant and redundant features of high dimensionality. The existence of these uninformative features may also deteriorate the predictive power of the learning model due to the curse of dimensionality. Feature selection, as an effective dimensionality reduction technique, has shown to be powerful in preparing high-dimensional data for numerous data mining and machine learning tasks. However, a vast majority of existing multi-label feature selection algorithms either boil down to solving multiple single-labeled feature selection problems or directly make use of the imperfect labels to guide the selection of representative features. As a result, they may not be able to obtain discriminative features shared across multiple labels. In this paper, to bridge the gap between rich source of multi-label information and its blemish in practical usage, we propose a novel noise resilient multi-label informed feature selection framework - MIFS by exploiting the correlations among different labels. In particular, to reduce the negative effects of imperfect label information in obtaining label correlations, we decompose the multi-label information of data instances into a low-dimensional space and then employ the reduced label representation to guide the feature selection phase via a joint sparse regression framework. Empirical studies on both synthetic and real-world datasets demonstrate the effectiveness and efficiency of the proposed MIFS framework.
One critical deficiency of traditional online kernel learning methods is their unbounded and growing number of support vectors in the online learning process, making them inefficient and non-scalable for large-scale applications. Recent studies on scalable online kernel learning have attempted to overcome this shortcoming, e.g., by imposing a constant budget on the number of support vectors. Although they attempt to bound the number of support vectors at each online learning iteration, most of them fail to bound the number of support vectors for the final output hypothesis which is often obtained by averaging the series of hypotheses over all the iterations. In this paper, we propose a novel framework for bounded online kernel methods, named ``Sparse Passive Aggressive (SPA)" learning, which is able to yield a final output kernel-based hypothesis with a bounded number of support vectors. Unlike the common budget maintenance strategy used by many existing budget online kernel learning approaches, the idea of our approach is to attain the bounded number of support vectors using an efficient stochastic sampling strategy which samples an incoming training example as a new support vector with a probability proportional to its loss suffered. We theoretically prove that SPA achieves an optimal mistake bound in expectation, and empirically show that it outperforms various budget online kernel learning algorithms. Finally, in addition to general online kernel learning tasks, we also apply SPA to derive bounded online multiple kernel learning algorithms, which can significantly improve the scalability of traditional Online Multiple Kernel Classification (OMKC) algorithms while achieving satisfactory learning accuracy as compared with the existing unbounded OMKC algorithms.
Recent decades has witnessed the rapid growth of educational data mining (EDM), which aims to automatically extract valuable information from large repositories of data generated by or related to people's learning activities in educational settings. One of the key EDM tasks is cognitive modelling based on examination data, which profiles examinees by discovering their latent knowledge state and cognitive level (e.g. the proficiency of specific skills) in a psychometrical way. However, the problem of extracting information from both objective and subjective exam problems to get more precise and interpretable cognitive analysis is still underexplored. To this end, in this paper, we propose a fuzzy cognitive diagnosis framework (FuzzyCDF) for examinees' cognitive modelling with both objective and subjective problems. Specifically, to handle the partially correct responses on subjective problems, we first fuzzify the skill proficiency of examinees. Then, we combine fuzzy set theory and educational hypotheses to model the examinees' mastery on the problems. Finally we simulate the generation of examination scores by considering both slip and guess factors to build the whole framework. For further comprehensive verification, we design effective solutions based on our FuzzyCDF for three classical cognitive assessment tasks, i.e. predicting examinee performance, slip & guess detection and cognitive diagnosis visualization. Extensive experiments on three real-world datasets for the three cognitive assessment tasks prove that FuzzyCDF can reveal the knowledge states and cognitive level of the examinees effectively and interpretatively.
Transfer Function (TF) generation is a fundamental problem in Direct Volume Rendering (DVR). A TF maps voxels to color and opacity values to reveal inner structures. Existing TF tools are complex and unintuitive for the users who are more likely to be medical professionals than computer scientists. In this paper, we propose a novel image-centric method for TF generation where instead of complex tools, the user directly manipulates volume data to generate DVR. The user's work is further simplified by presenting only the most informative volume slices for selection. Based on the selected parts, the voxels are classified using our novel Sparse Nonparametric Support Vector Machine classifier, which combines both local and near-global distributional information of the training data. The voxel classes are mapped to aesthetically pleasing and distinguishable color and opacity values using harmonic colors. Experimental results on several benchmark datasets and a detailed user survey show the effectiveness of the proposed method.
Energy disaggregation, the task of taking a whole home electricity signal and decomposing it into its component appliances, has been proved to be essential in energy conservation research. One powerful cue for breaking down the entire household's energy consumption is user's daily energy usage behavior, which has so far received little attention: existing works on energy disaggregation mostly ignored the relationship between the energy usages of various appliances by householders across different time slots. The major challenge in modeling such relationship in that, with ambiguous appliance usage membership of householders, we find it difficult to appropriately model the influence between appliances, since such influence is determined by human behaviors in energy usage. To address this problem, we propose to model the influence between householders' energy usage behaviors directly through a novel probabilistic model, which combines topic models with the Hawkes processes. The proposed model simultaneously disaggregates the whole home electricity signal into each component appliance and infers the appliance usage membership of household members, and enables those two tasks mutually benefit each other. Experimental results on both synthetic data and four real world data sets demonstrate the effectiveness of our model, which outperforms state-of-the-art approaches in not only decomposing the entire consumed energy to each appliance in houses, but also the inference of household structures. We further analyze the inferred appliance-householder assignment and the corresponding influence within the appliance usage of each householder and across different householders, which provides insight into appealing human behavior patterns in appliance usage