现在的位置: 首页 > 综合 > 正文

Human Action Recognition/Tracking

2013年09月16日 ⁄ 综合 ⁄ 共 7721字 ⁄ 字号 评论关闭

Human action recognition is the process of detecting human action in the video and labeling it in the image sequences(a video is an image sequence). Some solutions to this problem have applications in domains such as visualsurveillance,video
retrieval and human-computer interaction. However, technical implementation is challenging due to lots of questions such as variations in motion performance, recording settings, inter-personal differencesand the differences of background. Why are these problems?
Let me explain.

In human action recognition field,researchers often adopt the process of initializationof image of every frame in the video, body structure analysis in every frame,tracking the human or object and recognition, but many elements influence the consequenceand
performance of these procedures. Firstly, we need index the location ofhuman in an image and label it for the next process, but variations in motionperformance and dynamic video bring difficulty to the process ofinitialization. Secondly, we get human or object
descriptor which often is avector by some robust algorithms and track the human orobject in the video sequences through this descriptor, but many record settingsdon’t produce a high resolution image for us
to get an accurate descriptor and affectthe performance of the tracking. Finally inter-personal differences also affectour recognition, because every people descriptor is different and thedescriptors of all people won’t be obtained. Clever researchers provide
lots ofperfect ideas to solve this problem in recent years

Optical flow descriptorwhich chooses the motion information of every pixel of image is used to solve theproblem of variation in motion performance and dynamic video in order to get theaccurate location of human in the video frame. After using the optical
flowalgorithm, we employ background subtraction to obtain the ROI (RegionOf Interest), label the location of human in the image and extract the descriptorof object region for the next process of tracking.

Several famous researchers propose using Local descriptorwhich describes theobservation as a collection of local descriptors or patches tosolve the problem of low resolution. Local descriptor makes the whole
image becomea series of cells and uses all of the descriptors of every cell which is calculatedby some typical algorithms to describe the feature of the whole image.

How to solve the problem of inter-personal differences? Manyresearchers devote their time and energy to the research and propose many valueideas. Ronald, a famous researcher from England, proposes thatwe can extract a lot of features of human and summarize
a common feature, in theend, we detect the human action by comparing a feature which is extracted froman un-know image with the common feature. How to get a common feature fromfinite datasets? It is a key and hard work in the current human action recognition.Recently
years, these are two main methods. The first method is directclassification which detects the human action from an image through using thedistance between the image descriptor of an observed sequenceand those in a sample sequence. This solution is so simplethat
it needs enormous samples to guarantee the consequence of a comparison isobjective. The second method is Discriminative classifiers, which does not need infinitesamples and obtain amathematicalformula through somecomplex algorithms
from finite samples. The different algorithmsdecide the performance of discriminative classifiers. Many researchers publishpapers in this field and bring a huge step to human action recognition.

       In the end, I want to summarize the stateof the art from three aspects. First, in general, Global image descriptor has been proven toyield good results which can usually be extracted from normal video sequences with low cost. The futuredirection will
be how to design better algorithm to extract more common feature of our object. Secondly,local descriptor is a perfect method used to get more information from theregion of interest (ROI), but it also brings a imperfection that the dimensionof the feature
vector becomes so enormous that the complexity of algorithmincreases recently years, therefore, Dimensionality reduction will be the future direction. Last,it will always be a key problem how to adjudicatea feature of human. In my view, more researchers will
devote their energy andtime in this field.

      

       In the first part of this paper, I think that we also have aconception and whole framework about human action recognition. Consequently,researchers have to face a new question what we can do by using this technologyof human action recognition. Objection
tracking and human tracking graduallycame into our sight and became a hot debate in video surveillancedomain. In recent years, lots of algorithms have been presented by manyexcellent researchers around the world. I am glad to introduce some classicalgorithms
for you in this paper in some popular and easy-to-understandlanguage.

       Before the technology of humanaction recognition, background subtraction is a common approach to detect the moveforeground and move object, but this approach only detect some move object whilethe background must be static and do not recognize what
is the object from ourtest picture and video surveillance datasets. For example, a moving car and amoving people can be detected by this approach and a group of people can befound by background subtraction. However, the technology does not count howmany people
in this group. Recently some researchers have employed theknowledge of human action recognition instead of background subtractiontechnology and obtained excellent result in actual test.

       Now, we can recognize human action in a frame, even throughcount how many people in this frame. However, how to track this people orobject in the next frame or how to find the same object in the next frame anddemonstrate this object is the same one
which appeared in the pre-frame? Manyresearchers paid their attention into this field.

       Hog (Histogram of Oriented Gradient) is a popular algorithm,which can recognize human action in an image accurately, therefore, many informationengineers employed Hog algorithm to detect human in every frame and judged thesimilarity of object came
from different frame through the geometrical distancebetween each other. This method is not a good way to track human and object inthe video, because Hog algorithm is so complex that it must cost lots of timeto recognize human from one picture. The bigger
image, the more time will becost to label the human. A normal video must play 25 frames in one second. Ifwe employ Hog algorithm to detect human in every frame and track human in avideo, the delay time of video will be so long that we do not see a normalvideo.

       The mean shift algorithm is a perfect mode to locate objectwhich have been detected from a video and track this object by an iterative procedure.The different between Hog (histogram of Oriented Gradient) and the Mean ShiftAlgorithm is that Hog algorithm
adopts the features of all picture to representthe feature of a object, but mean shift algorithm only employs some or parts offeatures of pixels to represent the feature of the whole picture. This process ofthe mean shift algorithm can reduce the complexity
and improve the speed oftracking algorithm. At the same time, computer must detect object or human fromthe whole of picture in every frame using HOG algorithm, while the algorithm ofmean shift only compare the similar feature of pixel in ROI (region ofinterest).
In recent years, mean shift algorithm has been employed in trackingfield, because the speed of compute is so fast that a normal video can beoffered for people.

       Now we also have a good algorithm to detect human action and objectand a fast algorithm to track our detected object. A new problem appeared inthe front of us that we do not track multiple targets, although we can detectmultiple objects by HOG algorithm.
Some researchers proposed a modifiedalgorithm called the fast mean shift algorithm which improved the speed ofalgorithm and reduced the complexity in order to track multiple objects in avideo. Why the fast mean shift algorithm can track multiple objects in
a video?Because this algorithm procedure converges to the nearest mode or region(reduce the ROI of detection), this change reduces the time of compute ofalgorithm. This feature allows that we can compute many times of mean shiftalgorithm in order to track
more objects. The experiment indicates that thebiggest number of tracking objects is 21 using the fast mean shift algorithm.

       Of course, many problems need to be solved in tracking domain.For example, how to detect the object in a frame and reduce the false rate? Howto track the multiple objects in a video? How to reject the interference ofoverlap objects? How to improve
the speed of algorithm in track? There are manyorientations of researching in tracking field. I hope our students of Tianjin Universitycan provide our achievements in this field in the future.

抱歉!评论已关闭.