Project/[Landmark Detection] RL

[Reinforcement Learning] reinforcement learning for anatomical landmark detection

HJChung 2020. 11. 18. 11:21

Alansary, Amir, et al. "Evaluating reinforcement learning agents for anatomical landmark detection." Medical image analysis 53 (2019): 156-164.

 

With Medical Image, since manual landmark annotation is time consuming and error prone, automatic methods were developed to tackle this problem.

They formulate the landmark detection problem as a sequential decision making process of a goal-oriented agent, navigating in an environment (the acquired image) towards a target landmark.

 

One of the main advantages of applying RL to the landmark detection problem is the ability to learn simultaneously both a search strategy and the appearance of the object of interest as a unified behavioral task for an artificial agent.

About Reinforcement Learning(RL)

Summarizing what I understood.. 1. Reinforcement Learning์ด๋ž€

 

 

staticํ•œ ๋ฐ์ดํ„ฐ ์…‹์—์„œ ๊ฑฐ์˜ ๋ฌดํ•œํ•˜๊ฒŒ ๋งŽ์ด ํ•„์š”ํ•œ (data, output)๋ฅผ ์‚ฌ์šฉํ•ด learningํ•˜๋Š” ๋ฐฉ๋ฒ• ๋Œ€์‹ ์—, ๋งค์ˆœ๊ฐ„ action์„ ์‹คํ–‰ํ•ด reward๋ฅผ ๋ฐ›์œผ๋ฉด์„œ ์ตœ์ข…์ ์œผ๋กœ ๋งจ ๋งˆ์ง€๋ง‰์— ๋ชจ๋“  reward์˜ ํ•ฉ (์ •ํ™•ํ•˜๊ฒŒ๋Š” ๊ทธ๊ฒƒ์˜ expectation์„)์ด maximizationํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” policy๋ฅผ learningํ•˜๋Š” ๊ฒƒ์ด ๊ฐ•ํ™”ํ•™์Šต์˜ ๊ฐœ๋…์ด๋‹ค.

 

์ด๋ ‡๊ฒŒ ์ด์ „ ์ƒํƒœ๋“ค์„ ์ฐธ๊ณ ํ•˜์ง€ ์•Š๊ณ  ํ˜„์žฌ ์ƒํƒœ๋งŒ์œผ๋กœ ์ตœ์„ ์˜ ๊ฒฐ์ •์„ ๋‚ด๋ฆด ์ˆ˜ ์žˆ๋Š” ๋งˆ๋ฅด์ฝ”ํ”„ ์„ฑ์งˆ์„ ์ง€๋‹Œ ๋ฌธ์ œ๋ฅผ '๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ๊ณผ์ •(Markov Decision Process (MDP))'๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. ์ด๋ ‡๊ฒŒ ๋ฏธ๋ž˜์˜ ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ์ตœ์ ์˜ ๋™์ž‘๋“ค์„ ์˜ค์ง ํ˜„์žฌ ์ƒํƒœ์˜ ์ •๋ณด๋งŒ์œผ๋กœ ๊ฒฐ์ •ํ•˜๋Š” MDP๋กœ ๊ฐ•ํ™”ํ•™์Šต ๋ฌธ์ œ๋ฅผ ์ •์˜ํ•˜๋ฉด ํ•ด๋ฒ•์ด ๋‹จ์ˆœํ•ด์ง€๊ธฐ๋•Œ๋ฌธ์— ๋ณดํ†ต RL๋ฌธ์ œ๋ฅผ ํ‘ผ๋‹ค๊ณ  ํ•˜๋ฉด ๊ฐ•ํ™”ํ•™์Šต์€ ๋งˆ๋ฅด์ฝ”ํ”„ ์„ฑ์งˆ์„ ๋งŒ์กฑํ•œ๋‹ค๋Š” ์ „์ œํ•˜์— Markov Decision Process (MDP)๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

 

 

๋ณธ์งˆ์ ์œผ๋กœ ํ•˜๋‚˜์˜ ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์–ด๋–ค Environment์—์„œ ํ–‰๋™ํ•˜๋Š” ์†Œํ”„ํŠธ์›จ์–ด Agent๋ฅผ ๊ตฌ์ถ•ํ•œ๋‹ค.

โ€ป Environment๋Š” state์™€ action์ด ์žˆ๊ณ  reward๋ฅผ ์‚ฐ์ถœํ•˜๋Š” ๊ณผ์ •์ด๋ฉด ์–ด๋–ค ๊ฒƒ์ด๋ผ๋„ ํ™˜๊ฒฝ์ด ๋  ์ˆ˜ ์žˆ๋‹ค. 

Agent๋Š” Environment์˜ ํ˜„์žฌ State๋ฅผ ์ž…๋ ฅ๋ฐ›๋Š”๋‹ค.

โ€ป ํ•˜๋‚˜์˜ State๋Š” ํŠน์ • ์‹œ์  t์—์„œ Environment์— ๊ด€ํ•œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ์˜ ์ง‘ํ•ฉ์ด๋‹ค.

๊ทธ๋ฆฌ๊ณ  Agent๋Š” ์ด State ์ •๋ณด๋ฅผ ์ด์šฉํ•ด์„œ ์–ด๋–ค Action์„ ์ทจํ•œ๋‹ค. 

โ€ป Agent๊ฐ€ Action์„ ๊ฒฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ Policy๋ผ๊ณ  ํ•œ๋‹ค. 

์ด Action์€ Environment์„ ๊ฒฐ์ •๋ก ์ ์œผ๋กœ ๋˜๋Š” ํ™•๋ฅ ์ ์œผ๋กœ ๋ณ€๊ฒฝํ•˜๋ฉฐ, ๊ทธ๋Ÿฌ๋ฉด Environment์˜ ํ˜„์žฌ state(st)๊ฐ€ ๋ฐ”๋€๋‹ค.(st+1)

์ด๋ ‡๊ฒŒ State(st)์—์„œ Agent๊ฐ€ Action at๋ฅผ ์ทจํ•ด์„œ ๋‹ค์ŒState(st+1)๋กœ ์ „์ดํ•˜๋ฉด Agent๋Š” ๊ทธ์— ๋”ฐ๋ฅธ Reward๋ฅผ ๋ฐ›๋Š”๋‹ค

Agent์˜ ๋ชฉ์ ์€ Reward์˜ ์žฅ๊ธฐ๊ฐ„ ๊ธฐ๋Œ€์น˜๋ฅผ ์ตœ๋Œ€๋กœ ๋งŒ๋“œ๋Š” Action์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

 

 

์–ด๋–ป๊ฒŒ ๋ชจ๋“  reward์˜ ํ•ฉ (์ •ํ™•ํ•˜๊ฒŒ๋Š” ๊ทธ๊ฒƒ์˜ expectation์„)์ด maximizationํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” policy๋ฅผ learning์‹œํ‚ฌ ์ˆ˜ ์žˆ์„๊นŒ?

 

Value Function

 

 

๋™์ž‘ ๊ฐ€์น˜ ํ•จ์ˆ˜(Qπ)๋Š” ์ƒํƒœ s์™€ ๋™์ž‘ a์˜ ์Œ(s, a)๋ฅผ ๊ทธ state์—์„œ policy π์— ๋”ฐ๋ผ action์„ ์ทจํ–ˆ์„ ๋•Œ ๊ธฐ๋Œ€ ๊ฐ€์น˜(reward)๋กœ mappingํ•˜๋Š” ํ•จ์ˆ˜์ด๋‹ค. ์ด ํ•จ์ˆ˜๋ฅผ Q function์ด๋ผ ํ•˜๊ณ , Q function์˜ ๊ฐ’์„ Q value๋ผ๊ณ  ํ•œ๋‹ค. 

Q-learning

๊ทธ๋Ÿฌ๋‚˜ ์ดˆ๊ธฐ์—๋Š” agent๋Š” ์ƒํƒœ ์ „์ด ํ™•๋ฅ ๊ณผ ๋ณด์ƒ์— ๋Œ€ํ•ด ์•Œ์ง€ ๋ชปํ•œ๋‹ค.

๋ณด์ƒ์— ๋Œ€ํ•ด์„œ ์•Œ๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ ์–ด๋„ ํ•œ ๋ฒˆ์€ ๊ฐ ์ƒํƒœ์™€ ์ „์ด๋ฅผ ๊ฒฝํ—˜ํ•ด์•ผํ•˜๊ณ , ์‹ ๋ขฐํ•  ๋งŒํ•œ ์ „์ด ํ™•๋ฅ ์„ ์•Œ๊ธฐ ์œ„ํ•ด์„œ๋„ ์—ฌ๋Ÿฌ๋ฒˆ ๊ฒฝํ—˜์„ ํ•ด์•ผ ํ•œ๋‹ค. 

๊ทธ๋ž˜์„œ ์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ์ฒ˜๋Ÿผ agent๊ฐ€ MDP์— ๋Œ€ํ•ด ์ผ๋ถ€ ์ •๋ณด๋งŒ ์•Œ๊ณ  ์žˆ์„ ๋•Œ๋ฅผ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋„๋ก ๊ฐ€์น˜ ๋ฐ˜๋ณต ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋ณ€ํ˜•ํ•œ ๊ฒƒ์ด ์‹œ๊ฐ„์ฐจ ํ•™์Šต(Temporal Difference learning(TD ํ•™์Šต))์ด๋ฉฐ, ์ด๋Ÿฌํ•œ ์‹œ๊ฐ„์ฐจ ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๋น„์Šทํ•˜๊ฒŒ Q-learning ์—ญ์‹œ ์ „์ด ํ™•๋ฅ ๊ณผ ๋ณด์ƒ์„ ์ดˆ๊ธฐ์— ์•Œ์ง€ ๋ชปํ•œ ์ƒํ™ฉ์—์„œ Q-๊ฐ€์น˜ ๋ฐ˜๋ณต ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ ์šฉํ•œ ๊ฒƒ์ด๋‹ค. 

Q-learning์€ agent๊ฐ€ ๋žœ๋คํ•œ ์ •์ฑ…์„ ์‚ฌ์šฉํ•˜์—ฌ ํƒ์ƒ‰ํ•˜๋Š” ๊ฒƒ์„ ๋ณด๊ณ  ์ ์ง„์ ์œผ๋กœ Q-value ์ถ”์ •์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐฉ์‹์œผ๋กœ ์ž‘๋™ํ•œ๋‹ค. 

 

Deep Q-learning

๊ทธ๋ฆฌ๊ณ  Q-value๋ฅผ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” DNN์„ Deep Q-network(DQN) ์ด๋ผ๊ณ  ํ•˜๊ณ , DQN์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ทผ์‚ฌ Q-learning์„ ํ•˜๋Š” ๊ฒƒ์„ Deep Q-learning ์ด๋ผ๊ณ  ํ•œ๋‹ค.

 

 

Reinforcement learning for landmark detection on Medical Image

  • Goal: an artificial agent learns to make a sequence of decisions towards the target anatomical landmark.
  • Environment: 3D medical image
  • State: each states defines a 3D Region of Interest (ROI) centered around the target landmark and current position. 
  • Agent: artificial agent
  • Action: [left, right, up, down, forward, backward] six actions, {โ€ฏ±โ€ฏax, โ€ฏ±โ€ฏay, โ€ฏ±โ€ฏaz}, in the positive or negative direction of x, y or z. 
  • DQN: DQN-based network architecture for anatomical landmark detection. Output of the CNN results in a Q-value for each action. The best action is selected based on the highest Q-value.

 

 

  • Reward: Agent๊ฐ€ target landmark์— ๊ฐ€๊น๊ฒŒ ์ด๋™ํ•˜๊ณ  ์ž‡๋Š”์ง€ ๋ฉ€๋ฆฌ ์ด๋™ํ•˜๊ณ  ์žˆ๋Š”์ง€์— ๋Œ€ํ•œ ์—ฌ๋ถ€
    • D: Euclidean distance between two points(the previous step and current step)
    • Pi: the current predicted landmark’s position at step i (ํ˜„์žฌ ์ƒํƒœ์—์„œ agent ์œ„์น˜)
    • Pt: the target ground truth landmark’s location. (์‹ค์ œ landmark ์œ„์น˜)

 

  • Terminal state: the terminal state during training when the distance between the current point of interest and the target landmark Pt are less than or equal to 1mm.

 

 


 

To Study..

 

 

Related Work

  • Ghesu et al. (2016) adopted a deep RL-agent to navigate in a 3D image with fixed step actions for automatic landmark detection. The artificial agent tries to learn the optimal path from any location to the target point by maximizing the accumulated rewards of taking sequential action steps.
  •  Xu et al. (2017), inspired by Ghesu et al. (2016), proposed a supervised method for action classification using image partitioning. Their model learns to extract an action map for each pixel of the input image across the whole image into directional classes towards the target point. They use a fully convolutional network (FCN) with a large receptive field to capture rich contextual information from the whole image. Their method achieves better results than using an RL agent, however, it is restricted to 2D or small sized 3D images due to the computational complexity of 3D CNNs.
  • In order to overcome this additional computational cost, Li et al. (2018) and Noothout et al. (2018) presented a patch-based iterative CNN to detect individual or multiple landmarks simultaneously.
  • Furthermore, Ghesu, Georgescu, Grbic, Maier, Hornegger, Comaniciu, 2017Ghesu, Georgescu, Zheng, Grbic, Maier, Hornegger, Comaniciu, 2019 extended their RL-based landmark detection approach to exploit multi-scale image representations.

And..RL, that has been applied to several medical imaging applications.

 

 

 

Reference

github.com/amiralansary/rl-medical/blob/master/examples/LandmarkDetection/SingleAgent/doc/midl_2018.pdf

www.sciencedirect.com/science/article/pii/S1361841518306121?casa_token=koo1U9Cl87QAAAAA:Q2WdkAOjsUfKPlvcZhZSJG30D8qlgzgHSRyZe0kjGrn_TSvSc1JL-nSrJp-Gh6EpeGLUfYK5WIY#fig0004

sanghyukchun.github.io/90/

www.slideshare.net/ssuser06e0c5/q-learning-cnn-object-localization