Jump to content

Remove useless UTF symbols


Recommended Posts

If you want to save a file without special symbols in the name, but don't want to gsub every single one, you might consider using this function:

-- Get the string using this function 
function getUTFlessString(str) 
  if not (type(str) == "string") then return end 
  local arr = UTF8ToCharArray(str) 
  local ln = { } 
  for k,v in pairs(arr) do 
    local sz = 0 
    for t in v:gmatch"." do 
      sz = sz + 1 -- count the size per symbol 
    end 
    if sz <= 1 then 
      table.insert(ln,v) 
    end 
  end 
  str = table.concat(ln,"") 
  return str 
end 
  
function UTF8ToCharArray(str) 
  local charArray = {} 
  local iStart = 0 
  local strLen = str:len() 
  local function bit(b) 
    return 2 ^ (b - 1) 
  end 
   
  local function hasbit(w, b) 
    return w % (b + b) >= b 
  end 
  local function checkMultiByte(i) 
    if (iStart ~= 0) then 
      charArray[#charArray + 1] = str:sub(iStart, i - 1) 
      iStart = 0 
    end 
  end 
  for i = 1, strLen do 
    local b = str:byte(i) 
    local multiStart = hasbit(b, bit(7)) and hasbit(b, bit(8)) 
    local multiTrail = not hasbit(b, bit(7)) and hasbit(b, bit(8)) 
     
    if (multiStart) then 
      checkMultiByte(i) 
      iStart = i 
    elseif (not multiTrail) then 
      checkMultiByte(i) 
      charArray[#charArray + 1] = str:sub(i, i) 
    end 
  end 
  checkMultiByte(strLen + 1) 
  return charArray 
end 

The usage is simple:

local c = "[DM] Packy Ѡ vol.13 Ѡ Defiance Ѡ" 
local dc = getUTFlessString(c) 
-- the result will be "[DM] Packy  vol.13  Defiance " 

  • Like 1
Link to comment
  • 2 weeks later...

One little thing there is that your hasbit function is not necessary. MTA has built-in support for bit manipulation - both bitAnd and bitTest will do that job, the latter being just a wrapper for bitAnd ( ... ) ~= 0.

I wrote a smaller alternative to your function:

function removeMultiByteChars ( str ) 
  
    local asciiStr = "" 
  
    for i = 1, utfLen ( str ) do 
        local c = utfSub ( str, i, i ) 
  
        if not bitTest ( 0x80, string.byte ( c ) ) then 
            asciiStr = asciiStr .. c 
        end 
    end 
  
    return asciiStr 
  
end 

It can be used in the same way as yours:

removeMultiByteChars ( "[DM] Packy Ѡ vol.13 Ѡ Defiance Ѡ" ) 
-- such should return "[DM] Packy  vol.13  Defiance " 

Edited by Guest
Link to comment
  • 1 month later...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...