regex - Bash Script parse files for multiple occurrence of string between pattern -
i doing little text processing find video content in html files uploaded users. have defined tag called "video" , users supposed put video files like
<video> abcd.mp4 </video> presently using awk extract line has video tag,
str=$(awk '/<video>/{flag=1;} /<\/video>/{print ;flag=0} flag { print }' file.html) the output contains tag too, prefix , suffix removal video file name. done this,
prefix="<video>" suffix="</video>" foo=${str#$prefix} foo=${foo%$suffix} but work files have video tags used once. files multiple usage of tags string returned awk starts first occurence of <video> till last occurence of </video>.
so question how should write script @ end of give me array of strings between <video> , </video> tag. how can change
<video> abcd.mp4 </video> to
<media> abcd.mp4 </media>.
to each tag itself:
grep -eo "<video>(.+?)</video>" myfile.html to text within tags:
grep -eo "<video>(.+?)</video>" myfile.html | sed -e "s|</?video>||g" if opening , closing tags on different lines:
tr "\n" " " < myfile.html | grep -eo "<video>(.+?)</video>" | sed -e "s|</?video>||g" example input:
this <video> video1.mp4 </video> file <other> <random> </tags> <media> media1.mp4 </media> <video> video2.mp4 </video> <media> media 2 spaces , on multiple lines.mp4 </media> example output:
video1.mp4 video2.mp4 to both video , media tags (please specify in original question):
tr "\n" " " < vid.html | grep -eo "<(video|media)>(.+?)</(video|media)>" | sed -e "s#</?(video|media)>##g" output:
video1.mp4 media1.mp4 video2.mp4 media 2 spaces , on multiple lines.mp4 for second question, run whole file through command:
sed -e "s|(</?)video>|\1media>|g" vid.html
Comments
Post a Comment