Skip to content

Latest commit

 

History

History
440 lines (328 loc) · 17.6 KB

README.md

File metadata and controls

440 lines (328 loc) · 17.6 KB

MED for CREST-Deep

This repository contains code for developing a system of MED(Multimedia Event Detection).

The code is only for the research purpose within CREST-Deep. You can download the code and do your own experiments freely.

The required data for developing the system is located in Tsubame. For the detailed information of accessing the data, please refer to Part I: Data.

If you have any questions or requirements, please do not hesitate to contact us under mengxi at ks.cs.titech.ac.jp, ryamamot at ks.cs.titech.ac.jp.

You can find more helpful information of running the baseline from the tutorial slides in slidesForMEDBaselineTutorial


CONTENT:

Part 0: Introduction to MED

Part I: Data

Part II: Evaluation

Part III: System Overview

Part IV: Frame Extraction

Part V: Deep Feature Extraction

Part VI: SVM Training and Testing

Part VII: LSTM Training and Testing


Part 0: Introduction to MED

MED: Multimedia event detection is one task of TRECVID: a large scale video information search and retrieval workshop hosted by NIST.

Video is becoming a new means of documenting everything from recipes to how to change a tire of a car. Ever expanding multimedia video content necessitates development of new technologies for retrieving relevant videos based solely on the audio and visual content of the video. Participating MED teams will create a system that quickly finds events in a large collection of search videos. -- http://www-nlpir.nist.gov/projects/tv2016/tv2016.html#med

MED System Overview

System

In this task, a system should find and rank videos including specified event from a large collection of videos. The event is specified with a textual description and a small number of example videos.

In contrast to event recognition, videos in the large collection may contain no or multiple events.


Part I: Data

Basically, we are required to use given training video data to construct a system that is able to judge whether an unknown video clip contains the following events:

20 Events for Classification

Event ID Event Name
E021 Attempting_a_bike_trick
E022 Cleaning_an_appliance
E023 Dog_show
E024 Giving_directions_to_a_location
E025 Marriage_proposal
E026 Renovating_a_home
E027 Rock_climbing
E028 Town_hall_meeting
E029 Winning_a_race_without_a_vehicle
E030 Working_on_a_metal_crafts_project
E031 Beekeeping
E032 Wedding_shower
E033 Non-motorized_vehicle_repair
E034 Fixing_musical_instrument
E035 Horse_riding_competition
E036 Felling_a_tree
E037 Parking_a_vehicle
E038 Playing_fetch
E039 Tailgating
E040 Tuning_musical_instrument

It is possible that a video clip contains not any events listed above. For the detailed description of each event, please refer to ***.

We place all the data needed for each module's input in Tsubame under the directory:

	/gs/hs0/tga-crest-deep/shinodaG

The whole data is split into six parts:

	LDC2011E41_TEST(32060 videos) 
	LDC2012E01(2000 videos)
	LDC2012E110(10899 videos) 
	LDC2013E115(1496 videos)
	LDC2013E56(242 videos)
	LDC2014E16(254 videos)

The parts that are involved in training are:

	LDC2011E41_TEST(portions of videos)
	LDC2012E01
	LDC2013E115

The parts that are involved in testing are:

	LDC2011E41_TEST(portions of videos) 
	LDC2012E110 
	LDC2013E56 
	LDC2014E16

The detailed split information for training and testing is contained in the csv annotation files. They are located in:

	/gs/hs0/tga-crest-deep/shinodaG/annotations/csv

For the detailed explanations of the annotations in csv, please refer to the Evaluation part.

  • Video Data (Input for Frame Extraction)

    The video data is located in

	/gs/hs0/tga-crest-deep/shinodaG/video
They are compressed with H.264 and stored in .mp4 format.
  • Frame Data (Input for Deep Feature Extraction)

    The frame data is located in

	/gs/hs0/tga-crest-deep/shinodaG/frame
  • Feature Data (Input for SVM and LSTM)

    The feature data is located in

	/gs/hs0/tga-crest-deep/shinodaG/feature
We provide two kinds of features, i.e. `avgFeature` and `perFrameFeature`. 

`avgFeature` is one-vector feature for one video. It is calculated by taking the average over the deep features of the frames within the video.

`perFrameFeature` contains multiple feature vectors for one video. Each feature vector corresponds to one frame in the video. `perFrameFeature` is stored under the h5 format, and every row in the h5 file is one vector corresponding to one frame. The order of the vectors follows the order of the time, i.e. the first row corresponds to the first frame, the second row corresponds to the second frame...
  • Model Data
We place the caffe models under
	/gs/hs0/tga-crest-deep/shinodaG/models/caffeModels
The Deep Feature Extraction Module requires `googLeNet` of caffe, which is located in
	/gs/hs0/tga-crest-deep/shinodaG/models/caffeModels/imageShuffleNet

Part II: Evaluation

https://www.nist.gov/sites/default/files/documents/itl/iad/mig/MED16_Evaluation_Plan_V1.pdf

The performance of the system is evaluated using mAP (mean Average Precision).

Average Precision is often used for measuring the performance of an information retrieval system.

For a given target event, the testing video clips are listed from top to bottom according to their relevance scores with respect to the target event, which are given by the system.

From that list, we are able to calculate the precisions and recalls in different cut-off thresholds of the list. Then the Average Precision is calculated as the size of the area under the P-R (Precision-Recall) curve.

In practice, we approximate the area under the P-R curve using the following formula:

	AP = (1 / n) * (sum_k_from_1_to_n precision(k)),

where n is the number of target videos in the testing dataset, precision(k) is the precision of the retrieval list that is cut off at the point where k target videos are just included.

Finally, mAP is calculated by taking the average of APs over all the events, which is used as the measurement of the system.

TRECVID provides us the video information and annotations for training and testing in csv form:

EVENTS-BG_20160701_ClipMD.csv provides information about the background videos for training. These videos contain not any 20 target events.

EVENTS-PS-100Ex_20160701_JudgementMD.csv provides annotations of positive and hard-negative video clips for training. Each row corresponds to a video, presenting the video ClipID, EventID and Instance_type. positive in the Instance_type indicates the video is positive, while miss indicates the video is hard-negative.

Kindred14-Test_20140428_ClipMD.csv provides information about the testing videos.

Kindred14-Test_20140428_Ref.csv provides annotations of the testing videos. Each row corresponds to a video-event combination, indicating whether the video contains the event. For example, "000069.E029","n" means the video 000069 contains no event of E029, while "996867.E037","y" means the video 996867 contains event of E037.

Kindred14-Test_20140428_EventDB.csv provides the correspondence information of eventIds and eventNames.

In addition to the above annotation files, we generate txt version ones for convenience to use in codes related to training an RNN network. For the detailed explanation of the txt annotations, please refer to AnnotationProcess in Part VII: LSTM Training and Testing.


Part III: System Overview

The whole system mainly consists of four modules, namely Frame Extraction, Deep Feature Extraction, SVM Training and Testing and LSTM Training and Testing. Please refer to the following figure to understand the relations among these modules.

Modules

  • Frame Extraction: This module extracts frame images from videos. The input is directory containing videos and list of videos (optional). The output is directory containing png images of video frames every 2 seconds.

  • Deep Feature Extraction: This module is for extracting the deep features from video frames. The input is the frames extracted from videos, and the output is the corresponding features. We use googLeNet in the baseline code, though it is possible to easily switch to the other CNN.

  • SVM Training and Testing: This module will train and test SVM with deep features. The input is the annotations for training and testing data and averaged deep feature over a video. The output is detection results and average precision of the system.

  • LSTM Training and Testing: This module aims to build an LSTM-based RNN for detecting events. The input is the features extracted from frames, and the output is an LSTM-based RNN model(training phase), or detection results(testing phase).

You can start your experiments from any module in this pileline, since we have prepared the processed data for each module's input in Tsubame. For the access of the data, please refer to Part I: Data.


Part IV: Frame Extraction

This module extracts frame images from videos. The input is directory containing videos and list of videos (optional). The output is directory containing png images of video frames every 2 seconds.

Requirements

  • ffmpeg - to extract frames from videos
    https://ffmpeg.org/
    Binaries stored in /work1/t2g-crest-deep/ShinodaLab/library/ffmpeg-3.2.4/bin/

Settings

  • videodir - (required) directory of videos

    Video files should be placed as following format:

	${videodir}/${videoname}.mp4
  • outdir - (required) directory of frames

    Frames will be output with names following format:

	${outdir}/${videoname}/${videoname}_00000001.png
	${outdir}/${videoname}/${videoname}_00000002.png
  • list - (optional) list of videos

    This file should contain only file names but not paths as follows:

	hoge.mp4
	fuga.mp4
if `list` is not specified, every .mp4 files under `videodir` will be processed.

Run

./extractFrames.sh

Part V: Deep Feature Extraction

This module is for extracting the deep features from video frames. The input is the frames extracted from videos, and the output is the corresponding features.

This module is written in Python and depends on:

-	Python 2.7
-	Caffe

You can easily import these dependencies by excuting the following if you are in Tsubame:

	source /usr/apps.sp3/nosupport/gsic/env/caffe-0.13.sh
	source /usr/apps.sp3/nosupport/gsic/env/python-2.7.7.sh

To run the code for extracting the features, please edit the variables in 'extractDeepFeaturesStarter.sh', and run:

	./extractDeepFeaturesStarter.sh

It will extract deep features of avgFeature and perFrameFeature. For the explanation of avgFeature and perFrameFeature, please refer to Part I: Data.

Please refer to the script extractDeepFeaturesStarter.sh for the configuration of variables.

Note: Now the code only supports extracting the deep feature from the pool5/7x7_s1 layer of googLeNet.


Part VI: SVM Training and Testing

This module will train and test SVM with deep features. The input is the annotations for training and testing data and averaged deep feature over a video. The output is detection results and average precision of the system.

Requirements

Settings

  • EXPID - (required) name of the experiment
  • TempOutDir - (required) temporally directory
  • LIBSVM - (required) Location of LIBSVM
  • IS_LINEAR - (required) SVM kernel type
    • 0 - Use RBF kernel
    • 1 - Use linear kernel
  • SVSUFFIX - (required) suffix of feature file name
  • ANNOT_DIR - (required) directory where annotation files are saved
  • TEST_DATA - (required) prefix of annotation files for testing
  • BG_DATA - (required) prefix of annotation files for training back-ground data
  • TRAIN_DATA - (required) prefix of annotation files for training positive data
  • TEST_SVDIR - (required) directories where feature files for testing are saved
  • TRAIN_SVDIR - (required) directories where feature files for training are saved

Run

./svm.sh

Outputs

  • ${EXPID}/${EXPID}.detection.csv - detection results
  • ${EXPID}/ap.csv - average precision and their mean

You are expected to get mAP 0.512 on the test set.


Part VII: LSTM Training and Testing

This module aims to build an LSTM-based RNN for detecting events. The input is the features extracted from frames, and the output is an LSTM-based RNN model (training phase), or detection results (testing phase).

This module is divided into three parts: AnnotationProcess, Lstm and ResultEvaluate.

AnnotationProcess processes the csv annotations and convert them into txt annotations, which are used as a part of input in Lstm.

AnnotationProcess is written in C/C++.

To compile the C/C++ code of AnnotationProcess, simply run:

	./compile.sh

It will give you an excutable convertCsvToTxt.

Editing the variables in the script convertCsvToTxt.sh and running:

	./convertCsvToTxt.sh

will give you the txt annotations in the place that you specify in the script.

The output variables include:

TRAIN_TXT_PATH, TEST_TXT_PATH: the txt annotation files of training and testing

Each row in the txt annotation file corresponds to a video clip. It includes the path of the feature of the video and a label index ranging from [1, 21]. [1, 20] corresponds to the eventID from E21 to E40, and 21 corresponds to the background label indicating not any target events are included.

NEW_TEST_REF_FILE: the test 'csv' annotation file, which will be used for evaluation in ResultEvaluate. The format is the same as the test csv file of input.

Lstm is written in Lua and depends on:

-	Torch

We have installed the Torch framework under:

	/gs/hs0/tga-crest-deep/shinodaG/library/torch/torch-master_17.8.8

If you are in Tsubame, you can easily import the framework into your environment by executing:

	source /gs/hs0/tga-crest-deep/shinodaG/library/env/torch.sh

To train your own LSTM model, simply edit the variables in trainStarter.sh:

TRAIN_ANNOTATION_PATH (input): the training txt annotation file you create in AnnotationProcess

MODEL_SAVING_DIR (output): the directory you would like to save your models to.

To train your own LSTM model, run:

	./trainStarter.sh

If you are in Tsubame3, you can instead submit a job, by first editing the submission script file (e.g. modifying -o, -e options) and run:

	qsub -g YOUR_GROUP_NAME submitLstmTrain.sh

The training is supposed to be completed within an hour.

To use your trained LSTM model for testing, edit the variables in testStarter.sh:

TEST_ANNOTATION_PATH (input): the test txt annotation file you create in AnnotationProcess

and edit the variables in testStarterBatch:

MODEL_DIR (input): the directory you put your trained models.

OUTPUT_DIR (output): the directory you want to store the softmax probabilities of test data in.

and then run

	./testStarterBatch.sh

If you are in Tsubame3, you can instead submit a job, by first editing the submission script file (e.g. modifying -o, -e options) and run:

	qsub -g YOUR_GROUP_NAME submitLstmTest.sh

The testing is supposed to be completed within 20 mins.

ResultEvaluate is written in Python and bash. It depends on:

- Python 2.7

To get the final AP (Average Precision) for the detection result, edit the variables in the script evaluateStarter.sh:

H5_SOFTMAX_DIR (input): The directory containing the 'h5' result files to be evaluated, which are output by 'lstm_test'

TEST_REF (input): the test csv file you create in AnnotationProcess

OUTPUT_AP_DIR (output): The directory to store mAP (mean Average Precision) scores

and then run:

	./evaluateStarter.sh

The AP performance will be written to the place that you specify (OUTPUT_AP_DIR) in the script.

Using the following parameters for training the Lstm, you are expected to get mAP around 0.43 on the test set.

	EPOCH_NUM=40
	BATCH_SIZE=128
	HIDDEN_UNIT=256
	LEARNING_RATE=0.001
	LEARNING_RATE_DECAY=1E-4
	WEIGHT_DECAY=0.01
	GRADIENT_CLIP=5