Robust Action Recognition Using Multi-Scale Spatial-Temporal Concatenations of Local Features as Natural Action Structures

Xiaoyuan Zhu; Meng Li; Xiaojian Li; Zhiyong Yang; Joseph Zhuo Tsien

doi:10.1371/journal.pone.0046686

Robust Action Recognition Using Multi-Scale Spatial-Temporal Concatenations of Local Features as Natural Action Structures

Xiaoyuan Zhu, Meng Li, Xiaojian Li, Zhiyong Yang, Joseph Zhuo Tsien

Research output: Contribution to journal › Article › peer-review

4 Scopus citations

Abstract

Human and many other animals can detect, recognize, and classify natural actions in a very short time. How this is achieved by the visual system and how to make machines understand natural actions have been the focus of neurobiological studies and computational modeling in the last several decades. A key issue is what spatial-temporal features should be encoded and what the characteristics of their occurrences are in natural actions. Current global encoding schemes depend heavily on segmenting while local encoding schemes lack descriptive power. Here, we propose natural action structures, i.e., multi-size, multi-scale, spatial-temporal concatenations of local features, as the basic features for representing natural actions. In this concept, any action is a spatial-temporal concatenation of a set of natural action structures, which convey a full range of information about natural actions. We took several steps to extract these structures. First, we sampled a large number of sequences of patches at multiple spatial-temporal scales. Second, we performed independent component analysis on the patch sequences and classified the independent components into clusters. Finally, we compiled a large set of natural action structures, with each corresponding to a unique combination of the clusters at the selected spatial-temporal scales. To classify human actions, we used a set of informative natural action structures as inputs to two widely used models. We found that the natural action structures obtained here achieved a significantly better recognition performance than low-level features and that the performance was better than or comparable to the best current models. We also found that the classification performance with natural action structures as features was slightly affected by changes of scale and artificially added noise. We concluded that the natural action structures proposed here can be used as the basic encoding units of actions and may hold the key to natural action understanding.

Original language	English (US)
Article number	e46686
Journal	PloS one
Volume	7
Issue number	10
DOIs	https://doi.org/10.1371/journal.pone.0046686
State	Published - Oct 4 2012

ASJC Scopus subject areas

General

Access to Document

10.1371/journal.pone.0046686

Cite this

@article{600be7df7a9e44daa8f3ca26d25a8a86,

title = "Robust Action Recognition Using Multi-Scale Spatial-Temporal Concatenations of Local Features as Natural Action Structures",

abstract = "Human and many other animals can detect, recognize, and classify natural actions in a very short time. How this is achieved by the visual system and how to make machines understand natural actions have been the focus of neurobiological studies and computational modeling in the last several decades. A key issue is what spatial-temporal features should be encoded and what the characteristics of their occurrences are in natural actions. Current global encoding schemes depend heavily on segmenting while local encoding schemes lack descriptive power. Here, we propose natural action structures, i.e., multi-size, multi-scale, spatial-temporal concatenations of local features, as the basic features for representing natural actions. In this concept, any action is a spatial-temporal concatenation of a set of natural action structures, which convey a full range of information about natural actions. We took several steps to extract these structures. First, we sampled a large number of sequences of patches at multiple spatial-temporal scales. Second, we performed independent component analysis on the patch sequences and classified the independent components into clusters. Finally, we compiled a large set of natural action structures, with each corresponding to a unique combination of the clusters at the selected spatial-temporal scales. To classify human actions, we used a set of informative natural action structures as inputs to two widely used models. We found that the natural action structures obtained here achieved a significantly better recognition performance than low-level features and that the performance was better than or comparable to the best current models. We also found that the classification performance with natural action structures as features was slightly affected by changes of scale and artificially added noise. We concluded that the natural action structures proposed here can be used as the basic encoding units of actions and may hold the key to natural action understanding.",

author = "Xiaoyuan Zhu and Meng Li and Xiaojian Li and Zhiyong Yang and Tsien, {Joseph Zhuo}",

year = "2012",

month = oct,

day = "4",

doi = "10.1371/journal.pone.0046686",

language = "English (US)",

volume = "7",

journal = "PloS one",

issn = "1932-6203",

publisher = "Public Library of Science",

number = "10",

}

TY - JOUR

T1 - Robust Action Recognition Using Multi-Scale Spatial-Temporal Concatenations of Local Features as Natural Action Structures

AU - Zhu, Xiaoyuan

AU - Li, Meng

AU - Li, Xiaojian

AU - Yang, Zhiyong

AU - Tsien, Joseph Zhuo

PY - 2012/10/4

Y1 - 2012/10/4

N2 - Human and many other animals can detect, recognize, and classify natural actions in a very short time. How this is achieved by the visual system and how to make machines understand natural actions have been the focus of neurobiological studies and computational modeling in the last several decades. A key issue is what spatial-temporal features should be encoded and what the characteristics of their occurrences are in natural actions. Current global encoding schemes depend heavily on segmenting while local encoding schemes lack descriptive power. Here, we propose natural action structures, i.e., multi-size, multi-scale, spatial-temporal concatenations of local features, as the basic features for representing natural actions. In this concept, any action is a spatial-temporal concatenation of a set of natural action structures, which convey a full range of information about natural actions. We took several steps to extract these structures. First, we sampled a large number of sequences of patches at multiple spatial-temporal scales. Second, we performed independent component analysis on the patch sequences and classified the independent components into clusters. Finally, we compiled a large set of natural action structures, with each corresponding to a unique combination of the clusters at the selected spatial-temporal scales. To classify human actions, we used a set of informative natural action structures as inputs to two widely used models. We found that the natural action structures obtained here achieved a significantly better recognition performance than low-level features and that the performance was better than or comparable to the best current models. We also found that the classification performance with natural action structures as features was slightly affected by changes of scale and artificially added noise. We concluded that the natural action structures proposed here can be used as the basic encoding units of actions and may hold the key to natural action understanding.

AB - Human and many other animals can detect, recognize, and classify natural actions in a very short time. How this is achieved by the visual system and how to make machines understand natural actions have been the focus of neurobiological studies and computational modeling in the last several decades. A key issue is what spatial-temporal features should be encoded and what the characteristics of their occurrences are in natural actions. Current global encoding schemes depend heavily on segmenting while local encoding schemes lack descriptive power. Here, we propose natural action structures, i.e., multi-size, multi-scale, spatial-temporal concatenations of local features, as the basic features for representing natural actions. In this concept, any action is a spatial-temporal concatenation of a set of natural action structures, which convey a full range of information about natural actions. We took several steps to extract these structures. First, we sampled a large number of sequences of patches at multiple spatial-temporal scales. Second, we performed independent component analysis on the patch sequences and classified the independent components into clusters. Finally, we compiled a large set of natural action structures, with each corresponding to a unique combination of the clusters at the selected spatial-temporal scales. To classify human actions, we used a set of informative natural action structures as inputs to two widely used models. We found that the natural action structures obtained here achieved a significantly better recognition performance than low-level features and that the performance was better than or comparable to the best current models. We also found that the classification performance with natural action structures as features was slightly affected by changes of scale and artificially added noise. We concluded that the natural action structures proposed here can be used as the basic encoding units of actions and may hold the key to natural action understanding.

UR - http://www.scopus.com/inward/record.url?scp=84867136880&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84867136880&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0046686

DO - 10.1371/journal.pone.0046686

M3 - Article

C2 - 23056403

AN - SCOPUS:84867136880

SN - 1932-6203

VL - 7

JO - PloS one

JF - PloS one

IS - 10

M1 - e46686

ER -

Robust Action Recognition Using Multi-Scale Spatial-Temporal Concatenations of Local Features as Natural Action Structures

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this