regex - Bash Script parse files for multiple occurrence of string between pattern -


i doing little text processing find video content in html files uploaded users. have defined tag called "video" , users supposed put video files like

<video> abcd.mp4 </video> 

presently using awk extract line has video tag,

str=$(awk '/<video>/{flag=1;} /<\/video>/{print ;flag=0} flag { print }' file.html) 

the output contains tag too, prefix , suffix removal video file name. done this,

prefix="<video>" suffix="</video>"               foo=${str#$prefix} foo=${foo%$suffix} 

but work files have video tags used once. files multiple usage of tags string returned awk starts first occurence of <video> till last occurence of </video>.

so question how should write script @ end of give me array of strings between <video> , </video> tag. how can change

<video> abcd.mp4 </video> 

to

<media> abcd.mp4 </media>. 

to each tag itself:

grep -eo "<video>(.+?)</video>" myfile.html 

to text within tags:

grep -eo "<video>(.+?)</video>" myfile.html | sed -e "s|</?video>||g" 

if opening , closing tags on different lines:

tr "\n" " " < myfile.html | grep -eo "<video>(.+?)</video>" | sed -e "s|</?video>||g" 

example input:

this <video> video1.mp4 </video>  file <other> <random> </tags> <media> media1.mp4 </media>  <video> video2.mp4 </video>  <media>     media 2 spaces  , on  multiple lines.mp4 </media> 

example output:

video1.mp4  video2.mp4  

to both video , media tags (please specify in original question):

tr "\n" " " < vid.html | grep -eo "<(video|media)>(.+?)</(video|media)>"  | sed -e "s#</?(video|media)>##g" 

output:

 video1.mp4   media1.mp4   video2.mp4   media 2 spaces      , on      multiple lines.mp4  

for second question, run whole file through command:

sed -e "s|(</?)video>|\1media>|g" vid.html 

Comments

Popular posts from this blog

c++ - CryptStringToBinary API behavior -

c++ - Correct method for redrawing a layered window -

java.util.scanner - How to read and add only numbers to array from a text file -